My dream job

So I’ve started a new job: I’m now working for the Wikimedia Foundation in the Community Tech team. It’s really quite amazing, actually: I go to “work” and do things that I really quite like doing and would be attempting to find time to do anyway if I were employed elsewhere. Not that I’m really into the swing of things yet—only two weeks in—but so far it’s pretty great.

I’m really excited about being part of an organisation that actually means something.

Imagine a world in which every single human being can freely share in the sum of all knowledge. That’s our commitment.

It’s a bit cheesy to quote that I know, but still: how nice it is to think that there’s something higher up the orgchart than an ever-increasing concentration of money.

Penguin Classics portal on Wikisource

I’ve made a start of a system to pull data from Wikidata and generate a portal for the Penguin Classics, with appropriate links for those that are on Wikisource or are ready to be transcribed.

I’m a bit of a Sparql newbie, so perhaps this could’ve been done in a single query. However, I’m doing it in two stages: first, gathering all the ‘works’ that have at least one edition published by Penguin Classics, and then finding all editions of each of those works and seeing if any of them are on Wikisource. Oh, and including the ones that aren’t, too!

Wikidata:WikiProject Books sort of uses the FRBF model to represent primarily books and editions (‘editions’ being a combination of manifestation and expression levels of the FRBF; i.e. an edition realises and embodies a work). So most of the metadata we want exists at the ‘work’ level: title, author, date of first publication, genre, etc.

At the ‘edition’ level we look for a link to Wikisource (because a main-namespace item on Wikisource is an edition… although this gets messy; see below), and a link to the edition’s transcription project. Actually, we also look for these on the work itself, because often Wikidata has these properties there instead or as well — which is wrong.

Strictly speaking, the work metadata shouldn’t have anything about where the work is on Wikisource (either mainspace or Index file). The problem with adhering to this, however, is that by doing so we break interwiki links from Wikisource to Wiktionary. Because a Wikipedia article is (almost always) about a work, and we want to link a top-level Wikisource mainspace pages to this work… and the existing systems for doing this don’t allow for the intermediate step of going from Wikisource to the edition, then to the work and then to Wikipedia.

So for now, my scruffy little script looks for project links at both levels, and seems to do so successfully.

The main problem now is that there’s just not much data about these books on Wikidata! I’ll get working on that next…

Penguin Classics on Wikisource

As a way of learning Sparql and more about Wikidata, I’m trying to make a list of which pre-1924 Penguin Classics are on Wikisource.

Penguin lists their books at penguin.com.au/browse/by-imprint/penguin-classics.

The following Wikidata Query Service query lists all editions published by Penguin, their date of original publication, and whether there’s an Index page on Wikisource for the work or edition.

SELECT ?edition ?editionLabel ?work ?workLabel ?originalPublicationDate ?wikisourceIndexForWork ?wikisourceIndexForEdition
WHERE
{
  ?edition wdt:P31 wd:Q3331189 .
  ?edition wdt:P577 ?publicationDate .
  ?edition wdt:P123 ?publisher .
  FILTER(
    ?publisher = wd:Q1336200 # Penguin Books Q1336200
    || ?publisher = wd:Q11281443 # Penguin Classics Q11281443
  )
  ?edition wdt:P629 ?work .
  OPTIONAL{ ?work wdt:P577 ?originalPublicationDate } .
  OPTIONAL{ ?work wdt:P1957 ?wikisourceIndexForWork } .
  OPTIONAL{ ?edition wdt:P1957 ?wikisourceIndexForEdition } .
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en" }
}

I’m not sure how often the WDS data is updated, but so far it’s not being very useful for on-the-fly checking of recent updates. I’m sure there’s a better way of doing that though.

Nyunga words on Wiktionary

I’ve pretty much finished moving a set of ‘template’ Nyunga-language Wiktionary entries into my userspace on Wiktionary, from where they can be copied into mainspace. There are a few dramas with differing character-sets between definitions in some of the word lists I’ve got, so a couple of letters are missing. There’s plenty that are there though, and mainly I’m interested now to see if this idea of copying, pasting, and then copy-editing these entries is going to be a sensible workflow.

I thought about bulk importing these directly into place, but the problem with that is (quite apart from the first fact that none of these wordlists have machine-readable part-of-speech data) that almost all of them are going to need cleaning up and improving. For example, “kabain nin nana kulert” is in there as an entry. It means “perhaps someone ate it and went away”, and (I’m guessing) isn’t an idiom and so really oughtn’t have it’s own entry. It can however be used as a citation in every single one of its constituent words. That’s something that I think is best left up to a human, rather that forcing a human to clean up a bot’s mistakes. Or take “tandaban” which has a definition of “jump, to [9]” (and the square bracket references are throughout this dataset and are not explained anywhere that I’ve been able to find). This should just be translated as “jump” with a link to the English verb; again, a script could handle that, but the myriad of incoming formats would take too much time to code.

Maybe I’m just not being clever enough about preparing the data, and an import script, in a rich enough way. But that could take ages before ever this data sees the light of day on Wiktionary; the approach I’ve used means that it’s there now for anyone who wants to work with it. There are also so very many improvements that a human editor can make along the way, that it seems we’ll have better data for fewer words… and that seems to be the correct trade-off. Wiktionary is a ‘forever’ project, after all!

Of course, the plan is to be able to extract the data after it’s been put in its proper place, and I’ve started work on a PHP library for doing just that. I’d rather do the code-work on that end of it, and put in the time for a human-mediated import at the beginning end.

All of this is a long-winded way of putting out there on the web, in this tiny way, an invitation for anyone to come and help see if this import is going to work at all! Will you help?

2016 begins

It’s 2016 and it seems like a good time to attempt some new type of explanation of things. Things in general, I mean, and things internety. Or, maybe not ‘explanation’ so much as formless rambling. That’s easier on the brain, given the amount of sleep I’ve been getting (i.e. sod all).

I’m four days in to the new working year, and some good bits of code are already shaping up (file attachment fields and schema-editing in Tabulate, hopefully both ready to roll before too much longer). Some odd bits of enterprise bureaucracy have nearly fallen on my head but for the most part missed me (whereon I’ve attempted the old I-didn’t-see-anything trick, and carried on regardless).

I had a couple of weeks off, and explored some great bits of the south west. So nice to be back at Wilyabrup (not climbing, just looking, and some mapping). And I didn’t even take my GPS to Walpole; good to be not attempting to Record Everything for a while.

Things for this year, perhaps: Wikisource proofreading; importing Nyunga words into Wiktionary; carry on with Tabulate; print CFB at long last; go to Wikimania; try to write every day; get MoonMoon working again properly for Planet Freo. But mostly: stop re-evaluating everything and just get on with what’s (reasonably and probably not perfectly) good enough and worthwhile. Code less! Work on content and data more; code only what’s required.

Wikisource category browser now has other languages

I’ve updated the Wikisource validated works’ category browser tool to include other languages. So far it’s just Italian, and to some perhaps-incorrect extent French (there’s only four? that’s not right).

I just need more Wikisources to tell me the names of their validated works and root categories, and it’ll just be a matter of adding these to the config to get them running.

The category list is updated weekly.

Wikisource needs your input « Wikimedia blog

A new Wikisource survey is being conducted!

During the survey, you will be asked questions regarding your personal involvement with the Wikisource project, your preferences regarding governance and technology, and your opinion on how a Wikisource Conference should be shaped. With the support of Wikimedia Österreich and Wikimedia Italia, a Project and Event Grant proposal is to be presented for such a conference. We would like to involve Wikisourcers in a joint venture both to spread knowledge about the project and to strengthen community bonds. This,

Read more: Wikisource needs your input « Wikimedia blog

I now use curly quotation marks when proofreading

There are currently two things that are annoying me about Wikisource books. These are: the inclusion of hyperlinks (to be all 1990s about it, with using that word); and the usage of straight quotation marks.

Links I can forgive, or even actively enjoy, in non-fiction; but in fiction, they have no place. (So think I, anyway.) Especially when they link to a sodding dictionary term! I know how to look up a word I don’t know. Sigh.

The curly-vs-straight argument is an odd one. We only have straight ones thanks to typewriters (or their manufacturers, I guess) not wanting to have two sorts for each type of quotation mark. So why we persist I cannot say! No, I can say… it’s mostly to do with ease of typing, on common systems, I think. It’s annoying to type the opening and the closing glyphs, when there’s only one button on the keyboard. But really! That might hold sway where there’s no automatic system for handling these things, but we have those systems and they work admirably. And certainly, when it comes to typesetting books that are going to be read by (we hope) very many people, it’s worth putting a bit more effort in to make them look nice.

Because that’s what it’s about, ultimately: making the text beautiful! For how many hundreds of years have people been taking terrific care over making books look nice?! Let’s not give up on that.

I’m not really sure why I’m writing this, today. (Probably due to the glass of White Rabbit I’ve just here.) It’s that I’m firing with the zeal of the converted! I am, you see. I used to not care about quotes, and think they should be left straight — now, I stand on speakers’ corner and holler to confused passersby!

So, would that ye enjoy yr ebooks?! Then set them with loveliness!

Right… where’s that beer…

Help archive Wikimedia Commons!

WikiTeam has released an update of the chronological archive of all Wikimedia Commons files, up to 2013. Now ~34 TB in total.

Just seed one or more of these torrents (typically 20-40 GB) and you’ll be like a brick in the Library of Alexandria (or something), doing your bit for permanent preservation of this massive archive.

From this post to wikimedia-l.

What goes Where on the Web

Every now and then I recap on where and what I store online. Today I do so again, while I’m rather feeling that there should be discrete and specific tools for each of the things.

Firstly there are the self-hosted items:

  1. WordPress for blogging (where photo and file attachments should be customized to the exact use in question, not linked from external sites). Is also my OpenID provider.
  2. Piwigo as the primary location for all photographs.
  3. MoonMoon for feed reading (and, hopefully one day, archiving).
  4. MediaWiki for family history sites that are closed-access.
  5. My personal DokuWiki for things that need to be collaboratively edited.

Then the third-party hosts:

  1. OpenStreetMap for map data (GPX traces) and blogging about map-making.
  2. Wikimedia Commons for media of general interest.
  3. The NLA’s Trove for correcting newspaper texts.
  4. Wikisource as a library.
  5. Twitter (although I’m not really sure why I list this here at all).

Finally, I’m still trying to figure out the best system for:

  1. Public family history research. There’s some discussion about this on Meta.