WikiCite 2017

(Firefox asked me to rate it this morning, with a little picture of a broken heart and five stars to select from. I gave it five (’cause it’s brilliant) and then it sent me to a survey on mozilla.com titled “Heavy User V2”, which sounds like the name of an confused interplanetary supply ship.)

Today WikiCite17 begins. Three days of talking and hacking about the galaxy that comprises Wikipedia, Wikidata, Wikisource, citations, and all bibliographic data. There are lots of different ways into this topic, and I’m focusing not on Wikipedia citations (which is the main drive of the conference, I think), but on getting (English) Wikisource metadata a tiny bit further along (e.g. figure out how to display work details on a Wikisource edition page); and on a little side project of adding a Wikidata-backed citation system to WordPress.

The former is currently stalled on me not understanding the details of P629 ‘edition or translation of’ — specifically whether it should be allowed to have multiple values.

The latter is rolling on quite well, and I’ve got it searching and displaying and the beginnings of updating ‘book’ records on Wikidata. Soon it shall be able to make lists of items, and insert the lists (or individual citations of items on them) into blog posts and pages. I’m not sure what the state of the art is in PHP of packages for formatting citations, but I’m hoping there’s something good out there.

And here is a scary chicken I saw yesterday at the Naturhistorisches Museum:

Scary chicken (Deinonychus antirrhopus)

MediaWiki Documentation Day 2017

It’s MediaWiki Documentation Day 2017!

So I’ve been documenting a couple of things, and I’ve added a bit to the Xtools manual.

The latter is actually really useful, not so much from the end-user’s point of view because I dare say they’ll never read it, but I always like writing documentation before coding. It makes the goal so much more clear in my mind, and then the coding is much easier. With agreed-upon documentation, writing tests is easier; with tests written, writing the code is easier.

Time for a beer — and I’ll drink to DFD (document first development)! Oh, and semantic linebreaks are great.

Editing MediaWiki pages in an external editor

I’ve been working on a MediaWiki gadget lately, for editing Wikisource authors’ metadata without leaving the author page. It’s fun working with and learning more about OOjs-UI, but it’s also a pain because gadget code is kept in Javascript pages in the MediaWiki namespace, and so every single time you want to change something it’s a matter of saving the whole page, then clicking ‘edit’ again, and scrolling back down to find the spot you were at. The other end of things—the re-loading of whatever test page is running the gadget—is annoying and slow enough, without having to do much the same thing at the source end too.

So I’ve added a feature to the ExternalArticles extension that allows a whole directory full of text files to be imported at once (namespaces are handled as subdirectories). More importantly, it also ‘watches’ the directories and every time a file is updated (i.e. with Ctrl-S in a text editor or IDE) it is re-imported. So this means I can have MediaWiki:Gadget-Author.js and MediaWiki:Gadget-Author.css open in PhpStorm, and just edit from there. I even have these files open inside a MediaWiki project and so autocompletion and documentation look-up works as usual for all the library code. It’s even quite a speedy set-up, luckily: I haven’t yet noticed having to wait at any time between saving some code, alt-tabbing to the browser, and hitting F5.

I dare say my bodged-together script has many flaws, but it’s working for me for now!

Wikisource Hangout

I wonder how long it takes after someone first starts editing a Wikimedia project that they figure out that they can read lots of Wikimedia news on https://en.planet.wikimedia.org/ — and when, after that, they realise they can also post to the news there? (At which point they probably give up if they haven’t already got a blog.)

Anyway, I forgot that I can post news, but then I remembered. So:

There’s going to be a Wikisource meeting next weekend (28 January, on Google Hangouts), if you’re interested in joining:
https://meta.wikimedia.org/wiki/Wikisource_Community_User_Group/January_2017_Hangout

My dream job

So I’ve started a new job: I’m now working for the Wikimedia Foundation in the Community Tech team. It’s really quite amazing, actually: I go to “work” and do things that I really quite like doing and would be attempting to find time to do anyway if I were employed elsewhere. Not that I’m really into the swing of things yet—only two weeks in—but so far it’s pretty great.

I’m really excited about being part of an organisation that actually means something.

Imagine a world in which every single human being can freely share in the sum of all knowledge. That’s our commitment.

It’s a bit cheesy to quote that I know, but still: how nice it is to think that there’s something higher up the orgchart than an ever-increasing concentration of money.

Penguin Classics portal on Wikisource

I’ve made a start of a system to pull data from Wikidata and generate a portal for the Penguin Classics, with appropriate links for those that are on Wikisource or are ready to be transcribed.

I’m a bit of a Sparql newbie, so perhaps this could’ve been done in a single query. However, I’m doing it in two stages: first, gathering all the ‘works’ that have at least one edition published by Penguin Classics, and then finding all editions of each of those works and seeing if any of them are on Wikisource. Oh, and including the ones that aren’t, too!

Wikidata:WikiProject Books sort of uses the FRBF model to represent primarily books and editions (‘editions’ being a combination of manifestation and expression levels of the FRBF; i.e. an edition realises and embodies a work). So most of the metadata we want exists at the ‘work’ level: title, author, date of first publication, genre, etc.

At the ‘edition’ level we look for a link to Wikisource (because a main-namespace item on Wikisource is an edition… although this gets messy; see below), and a link to the edition’s transcription project. Actually, we also look for these on the work itself, because often Wikidata has these properties there instead or as well — which is wrong.

Strictly speaking, the work metadata shouldn’t have anything about where the work is on Wikisource (either mainspace or Index file). The problem with adhering to this, however, is that by doing so we break interwiki links from Wikisource to Wiktionary. Because a Wikipedia article is (almost always) about a work, and we want to link a top-level Wikisource mainspace pages to this work… and the existing systems for doing this don’t allow for the intermediate step of going from Wikisource to the edition, then to the work and then to Wikipedia.

So for now, my scruffy little script looks for project links at both levels, and seems to do so successfully.

The main problem now is that there’s just not much data about these books on Wikidata! I’ll get working on that next…

Penguin Classics on Wikisource

As a way of learning Sparql and more about Wikidata, I’m trying to make a list of which pre-1924 Penguin Classics are on Wikisource.

Penguin lists their books at penguin.com.au/browse/by-imprint/penguin-classics.

The following Wikidata Query Service query lists all editions published by Penguin, their date of original publication, and whether there’s an Index page on Wikisource for the work or edition.

SELECT ?edition ?editionLabel ?work ?workLabel ?originalPublicationDate ?wikisourceIndexForWork ?wikisourceIndexForEdition
WHERE
{
  ?edition wdt:P31 wd:Q3331189 .
  ?edition wdt:P577 ?publicationDate .
  ?edition wdt:P123 ?publisher .
  FILTER(
    ?publisher = wd:Q1336200 # Penguin Books Q1336200
    || ?publisher = wd:Q11281443 # Penguin Classics Q11281443
  )
  ?edition wdt:P629 ?work .
  OPTIONAL{ ?work wdt:P577 ?originalPublicationDate } .
  OPTIONAL{ ?work wdt:P1957 ?wikisourceIndexForWork } .
  OPTIONAL{ ?edition wdt:P1957 ?wikisourceIndexForEdition } .
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en" }
}

I’m not sure how often the WDS data is updated, but so far it’s not being very useful for on-the-fly checking of recent updates. I’m sure there’s a better way of doing that though.

Nyunga words on Wiktionary

I’ve pretty much finished moving a set of ‘template’ Nyunga-language Wiktionary entries into my userspace on Wiktionary, from where they can be copied into mainspace. There are a few dramas with differing character-sets between definitions in some of the word lists I’ve got, so a couple of letters are missing. There’s plenty that are there though, and mainly I’m interested now to see if this idea of copying, pasting, and then copy-editing these entries is going to be a sensible workflow.

I thought about bulk importing these directly into place, but the problem with that is (quite apart from the first fact that none of these wordlists have machine-readable part-of-speech data) that almost all of them are going to need cleaning up and improving. For example, “kabain nin nana kulert” is in there as an entry. It means “perhaps someone ate it and went away”, and (I’m guessing) isn’t an idiom and so really oughtn’t have it’s own entry. It can however be used as a citation in every single one of its constituent words. That’s something that I think is best left up to a human, rather that forcing a human to clean up a bot’s mistakes. Or take “tandaban” which has a definition of “jump, to [9]” (and the square bracket references are throughout this dataset and are not explained anywhere that I’ve been able to find). This should just be translated as “jump” with a link to the English verb; again, a script could handle that, but the myriad of incoming formats would take too much time to code.

Maybe I’m just not being clever enough about preparing the data, and an import script, in a rich enough way. But that could take ages before ever this data sees the light of day on Wiktionary; the approach I’ve used means that it’s there now for anyone who wants to work with it. There are also so very many improvements that a human editor can make along the way, that it seems we’ll have better data for fewer words… and that seems to be the correct trade-off. Wiktionary is a ‘forever’ project, after all!

Of course, the plan is to be able to extract the data after it’s been put in its proper place, and I’ve started work on a PHP library for doing just that. I’d rather do the code-work on that end of it, and put in the time for a human-mediated import at the beginning end.

All of this is a long-winded way of putting out there on the web, in this tiny way, an invitation for anyone to come and help see if this import is going to work at all! Will you help?