Nyunga words on Wiktionary

I’ve pretty much finished moving a set of ‘template’ Nyunga-language Wiktionary entries into my userspace on Wiktionary, from where they can be copied into mainspace. There are a few dramas with differing character-sets between definitions in some of the word lists I’ve got, so a couple of letters are missing. There’s plenty that are there though, and mainly I’m interested now to see if this idea of copying, pasting, and then copy-editing these entries is going to be a sensible workflow.

I thought about bulk importing these directly into place, but the problem with that is (quite apart from the first fact that none of these wordlists have machine-readable part-of-speech data) that almost all of them are going to need cleaning up and improving. For example, “kabain nin nana kulert” is in there as an entry. It means “perhaps someone ate it and went away”, and (I’m guessing) isn’t an idiom and so really oughtn’t have it’s own entry. It can however be used as a citation in every single one of its constituent words. That’s something that I think is best left up to a human, rather that forcing a human to clean up a bot’s mistakes. Or take “tandaban” which has a definition of “jump, to [9]” (and the square bracket references are throughout this dataset and are not explained anywhere that I’ve been able to find). This should just be translated as “jump” with a link to the English verb; again, a script could handle that, but the myriad of incoming formats would take too much time to code.

Maybe I’m just not being clever enough about preparing the data, and an import script, in a rich enough way. But that could take ages before ever this data sees the light of day on Wiktionary; the approach I’ve used means that it’s there now for anyone who wants to work with it. There are also so very many improvements that a human editor can make along the way, that it seems we’ll have better data for fewer words… and that seems to be the correct trade-off. Wiktionary is a ‘forever’ project, after all!

Of course, the plan is to be able to extract the data after it’s been put in its proper place, and I’ve started work on a PHP library for doing just that. I’d rather do the code-work on that end of it, and put in the time for a human-mediated import at the beginning end.

All of this is a long-winded way of putting out there on the web, in this tiny way, an invitation for anyone to come and help see if this import is going to work at all! Will you help?

Drupal terminology vs old-fashioned DB words

Drupal’s entity model is pretty confusing if one is used to the strict world of ‘proper’ RDBMSs, but it does start to make sense after a while. It’s best to just forget about all the cruft that comes with the standard installation, and work with the base functionality (and some that’s currently provided by modules but seems will be in D8 core).

Taxonomy, for instance, provides a subset of the functionality that can be built with entity references (a.k.a. foreign keys in the relational model, except there’s no integrity!).

The commenting system (which is also in core) can be constructed with basic Drupal things like content types, views, blocks, and rules.

Same goes for Book pages. Everything, really.

(At least, this is my current thesis; an attempt to reduce the number of modules I have to get my head around to under a thousand…)

Basically one needs to just know of the following:

Database term Drupal equivalent (sort of)
Table Content (or Node) Type
Row Node
Column Field
Foreign key Entity reference field
View View
Enum field type list_text field type
Boolean field type list_boolean field type

Don’t Write Code (write descriptions of things)

I wish I didn’t know how to code.

For a programmer, the solution to every problem is to write more code.

But sometimes, all that is needed is to write proper words. To explain things and explore them through prose.

Not to remove oneself to the meta-realm of trying to understand the general structure of the problem and model it accordingly. (And then build something that resembles that model, and hope that the people using it see through the layers back to what the buggery’s trying to be done!)

Just write some nice, verbose, rambling blather about what it is and how it works and where we’re trying to go from here. Nothing too technical, and hopefully actually interesting to read. At least, linear, in that old-fashioned way of real writing. Interesting is probably too much to aim for… just words, then.

I was reading Phoebe Ayers recent post about the task of archiving the Wikimedia Foundation’s material. My first thought was “what sort of database/catalogue would be useful for this sort of thing?” Which is quite the wrong question, of course. There’s a whole world of wikis (both instances and engines) out there, perfect for this sort of variably-structured data. (If there’s one thing that constantly amazes me about Wikipedia it’s the fact that so much structure and repeated data is contained in what is basically an immense flat list of lone text files, and that it does rather work! The database geek in me shudders.)

I think a basic tennent for archiving physical and digital resources is that each object, and each grouping of objects, needs to have its own web page. In most cases, I use this both as a catalogue entry for the object or group, and as a printable coversheet to store along with the physical objects (or, in the case of digital-only objects, to be a physical placeholder or archive copy, if they warrant it).

The other thing I try to stick to is that a fonds and its catalogue (i.e. a pile of folders/boxes and the website that indexes them and adds whatever other digital material to the mix) should be able to be shifted off to someone else to maintain! That not everything should live in the same system, nor require particularly technical skills to maintain.

I know that there’s a dozen formalised ways of doing this stuff, and I wish I knew the details of them more thoroughly! For now, I’ll hope that a non-structured catalogue can work, and continue to write little printable English-language wiki pages to collate in amongst my folders of polypropylene document sleeves. And I’ll keep checking back to en.wikibooks.org/wiki/Subject:Library_and_Information_Science for instructions on how to do it better…