Things should be ‘projects’, not ‘systems’. They should end, so they can be forgotten. They must be in a fit state to be ended and forgotten. Books work with that idea, but websites are trickier. That’s slightly annoying, but there are great tools for making it easier. Not as easy as sticking a book in a cupboard for a century though. Hmm. I think I need another beer….

Wikisource books for binding

I have been experimenting with turning Wikisource works into LaTeX-formatted bindable PDFs. My initial idea was to produce quatro or octavo layout sheets (i.e. 8 or 16 book pages to a sheet of paper that’s printed on both sides and has the pages layed out in such a way as when the sheet is folded the pages are in the correct order) but now I’m thinking of just using a print-on-demand service (hopefully Pediapress, because they seem pretty brilliant).

Basically, my tool downloads all of a work’s pages and subpages (in the main namespace only; it doesn’t care about the method of construction of the work) and saves the HTML for these, in order, to a html/ directory. Then (here’s the crux of the thing) it uses Pandoc to create a set of matching TeX files in an adjacent latex/ directory.

So far, so obvious. But the trouble with this approach of wanting to create a separate source format for a work is that there are changes that one wants to make to the work (either formatting or structural) that can’t be made upstream on Wikisource — but we also want to be able to bring down updates at any time from Wikisource. That is to say, this is creating a fork of the work in a different format, but it’s a fork that needs to be able to be kept up to date.

My current solution to this is to save the HTML and LaTeX files in a Git repository (one per work) and have two branches: one containing the raw un-edited HTML and LaTeX, on which the download operation can be re-run at any time; and the other being based off this, being a place to make any edits required, and which can have the first merged into it whenever that’s updated. This will sometimes result in merge conflicts, but for the most part (because the upstream changes are generally small typo fixes and the like) will happen without error.

Now I just want to automate all this a little bit more, so a new project can be created (with GitHub repo and all) with a single (albeit slow!) command.

The output ends up something like The Nether World by George Gissing.pdf.

WikiCite 2017

(Firefox asked me to rate it this morning, with a little picture of a broken heart and five stars to select from. I gave it five (’cause it’s brilliant) and then it sent me to a survey on titled “Heavy User V2”, which sounds like the name of an confused interplanetary supply ship.)

Today WikiCite17 begins. Three days of talking and hacking about the galaxy that comprises Wikipedia, Wikidata, Wikisource, citations, and all bibliographic data. There are lots of different ways into this topic, and I’m focusing not on Wikipedia citations (which is the main drive of the conference, I think), but on getting (English) Wikisource metadata a tiny bit further along (e.g. figure out how to display work details on a Wikisource edition page); and on a little side project of adding a Wikidata-backed citation system to WordPress.

The former is currently stalled on me not understanding the details of P629 ‘edition or translation of’ — specifically whether it should be allowed to have multiple values.

The latter is rolling on quite well, and I’ve got it searching and displaying and the beginnings of updating ‘book’ records on Wikidata. Soon it shall be able to make lists of items, and insert the lists (or individual citations of items on them) into blog posts and pages. I’m not sure what the state of the art is in PHP of packages for formatting citations, but I’m hoping there’s something good out there.

And here is a scary chicken I saw yesterday at the Naturhistorisches Museum:

Scary chicken (Deinonychus antirrhopus)

Penguin Classics portal on Wikisource

I’ve made a start of a system to pull data from Wikidata and generate a portal for the Penguin Classics, with appropriate links for those that are on Wikisource or are ready to be transcribed.

I’m a bit of a Sparql newbie, so perhaps this could’ve been done in a single query. However, I’m doing it in two stages: first, gathering all the ‘works’ that have at least one edition published by Penguin Classics, and then finding all editions of each of those works and seeing if any of them are on Wikisource. Oh, and including the ones that aren’t, too!

Wikidata:WikiProject Books sort of uses the FRBF model to represent primarily books and editions (‘editions’ being a combination of manifestation and expression levels of the FRBF; i.e. an edition realises and embodies a work). So most of the metadata we want exists at the ‘work’ level: title, author, date of first publication, genre, etc.

At the ‘edition’ level we look for a link to Wikisource (because a main-namespace item on Wikisource is an edition… although this gets messy; see below), and a link to the edition’s transcription project. Actually, we also look for these on the work itself, because often Wikidata has these properties there instead or as well — which is wrong.

Strictly speaking, the work metadata shouldn’t have anything about where the work is on Wikisource (either mainspace or Index file). The problem with adhering to this, however, is that by doing so we break interwiki links from Wikisource to Wiktionary. Because a Wikipedia article is (almost always) about a work, and we want to link a top-level Wikisource mainspace pages to this work… and the existing systems for doing this don’t allow for the intermediate step of going from Wikisource to the edition, then to the work and then to Wikipedia.

So for now, my scruffy little script looks for project links at both levels, and seems to do so successfully.

The main problem now is that there’s just not much data about these books on Wikidata! I’ll get working on that next…

Reading, quietly

There is something exquisite in the act of sitting still in a comfortable place, reading, with a nice view. Writing sometimes, to note down the habits of passing sheep. And sometimes drinking, perhaps, once the sun has dropped to that point in the sky. But the important thing is the still small quiet that can be felt, occasionally. And the ability to see a bit of a distance. A good book helps, but isn’t really the point.

Penguin Classics on Wikisource

As a way of learning Sparql and more about Wikidata, I’m trying to make a list of which pre-1924 Penguin Classics are on Wikisource.

Penguin lists their books at

The following Wikidata Query Service query lists all editions published by Penguin, their date of original publication, and whether there’s an Index page on Wikisource for the work or edition.

SELECT ?edition ?editionLabel ?work ?workLabel ?originalPublicationDate ?wikisourceIndexForWork ?wikisourceIndexForEdition
  ?edition wdt:P31 wd:Q3331189 .
  ?edition wdt:P577 ?publicationDate .
  ?edition wdt:P123 ?publisher .
    ?publisher = wd:Q1336200 # Penguin Books Q1336200
    || ?publisher = wd:Q11281443 # Penguin Classics Q11281443
  ?edition wdt:P629 ?work .
  OPTIONAL{ ?work wdt:P577 ?originalPublicationDate } .
  OPTIONAL{ ?work wdt:P1957 ?wikisourceIndexForWork } .
  OPTIONAL{ ?edition wdt:P1957 ?wikisourceIndexForEdition } .
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en" }

I’m not sure how often the WDS data is updated, but so far it’s not being very useful for on-the-fly checking of recent updates. I’m sure there’s a better way of doing that though.

Where to work on ebooks? Wikisource vs GITenberg

Not enough photos are taken of the undersides of shop awnings.

This morning I’m at Parlapa, the lovely little caffe opposite the town hall. It’s a good place to be sat, with a slight hangover, with some nice small WordPress code to be working on, and of course with a coffee. The only down side is the fact that the City wifi almost reaches here, so I’ve got the most tantalising of faint signals and so keep trying to connect; I should give that up, and read a book.

I’m re-reading Tolstoy’s Dictaphone, which is a terrific book. But I’ve left it at home, un-terrifically, and so instead am reading Live and Let Live by Catharine Maria Sedgwick. Only read the first two pages so far so I’ve no idea what it’s about, and anyway keep getting distracted by typographical errors (so far, all resulting from the fact that Kobos don’t support small-caps. What a joke!).

Talking of small-caps, there’s movement at the GITenberg station, with a project underway to convert PG books to unicode and to use proper punctuation characters (for quotation marks and dashes, at least). The idea is to use Asciidoc, but there is no standard way to express small-caps. In fact, none of the popular lightweight markup languages seem to have small-caps; what an oversight!

So if I were with a more solid connection, I’d try to run the punctuation-fixing scripts against one of Mr Gissing’s works. Because there’s something nicer about working on books as stand-alone Git repositories, rather than in the mammoth universe of Wikisource and the WMF. A feeling that one is producing single editions, and perhaps a number of different formats for each — and is able to give each its due attention. The wikitext-as-source-format paradigm gets a bit tiring sometimes, because although the HTML output is great, and that makes for good ebooks (well, Kobo and its small-caps-ignorance aside), I’d really like to be able to produce printable (and thus bindable) output as well. Say, via LaTeX. And maybe Asciidoc is one way of doing that.

Really, the main thing that PG is missing (and GITenberg, although it’s probably easier to rectify there) is the ability to confer with the original source scans.

Publisher-provided metadata

Reading on an ereader, I seem to lose all of the “publisher’s metadata”: there is no longer any hint of what type of book this is — no cover to judge, no binding, no typography to tell if it’s a serious literary thing or a pulpy time-passer or an old forgotten once-loved.

It’s probably good this way. Lets the text speak for itself. Mainly the loss harms my ability to recall a book, more than the way I receive its words. No more recollection of 20th century authors as dusty orange Penguins with failing glue. Now they sit alongside every other of any time whose surname begins as theirs does, or is (as arbitrarily) co-alphabetically titled.

Perhaps what I’m looking for is a chronology of literature? Victorians vs. post-war makes more sense than the alphabet as a reading criteria!