The question of whether to use “curly quotes” on Wikisource has come up again.
I have been experimenting with turning Wikisource works into LaTeX-formatted bindable PDFs. My initial idea was to produce quatro or octavo layout sheets (i.e. 8 or 16 book pages to a sheet of paper that’s printed on both sides and has the pages layed out in such a way as when the sheet is folded the pages are in the correct order) but now I’m thinking of just using a print-on-demand service (hopefully Pediapress, because they seem pretty brilliant).
Basically, my tool downloads all of a work’s pages and subpages (in the main namespace only; it doesn’t care about the method of construction of the work) and saves the HTML for these, in order, to a html/ directory. Then (here’s the crux of the thing) it uses Pandoc to create a set of matching TeX files in an adjacent latex/ directory.
So far, so obvious. But the trouble with this approach of wanting to create a separate source format for a work is that there are changes that one wants to make to the work (either formatting or structural) that can’t be made upstream on Wikisource — but we also want to be able to bring down updates at any time from Wikisource. That is to say, this is creating a fork of the work in a different format, but it’s a fork that needs to be able to be kept up to date.
My current solution to this is to save the HTML and LaTeX files in a Git repository (one per work) and have two branches: one containing the raw un-edited HTML and LaTeX, on which the download operation can be re-run at any time; and the other being based off this, being a place to make any edits required, and which can have the first merged into it whenever that’s updated. This will sometimes result in merge conflicts, but for the most part (because the upstream changes are generally small typo fixes and the like) will happen without error.
Now I just want to automate all this a little bit more, so a new project can be created (with GitHub repo and all) with a single (albeit slow!) command.
The output ends up something like The Nether World by George Gissing.pdf.
I figured out how to do some slight video editing in OpenShot, and have tried to make another Wikisource tutorial:
There are a few mistakes here, and as I’ve yet to figure out how to edit videos properly (I’ve only managed to hang my video editing software so far), they’ve stayed in; I’ll do another video correcting things.
The pagelist creation process is probably the hardest bit for beginners to Wikisource, and it’s something we need to work on. Metadata copying, on the other hand, mostly works fine (of course, we should not be copying the metadata, but that’s another story).
(Firefox asked me to rate it this morning, with a little picture of a broken heart and five stars to select from. I gave it five (’cause it’s brilliant) and then it sent me to a survey on mozilla.com titled “Heavy User V2”, which sounds like the name of an confused interplanetary supply ship.)
Today WikiCite17 begins. Three days of talking and hacking about the galaxy that comprises Wikipedia, Wikidata, Wikisource, citations, and all bibliographic data. There are lots of different ways into this topic, and I’m focusing not on Wikipedia citations (which is the main drive of the conference, I think), but on getting (English) Wikisource metadata a tiny bit further along (e.g. figure out how to display work details on a Wikisource edition page); and on a little side project of adding a Wikidata-backed citation system to WordPress.
The former is currently stalled on me not understanding the details of P629 ‘edition or translation of’ — specifically whether it should be allowed to have multiple values.
The latter is rolling on quite well, and I’ve got it searching and displaying and the beginnings of updating ‘book’ records on Wikidata. Soon it shall be able to make lists of items, and insert the lists (or individual citations of items on them) into blog posts and pages. I’m not sure what the state of the art is in PHP of packages for formatting citations, but I’m hoping there’s something good out there.
And here is a scary chicken I saw yesterday at the Naturhistorisches Museum:
So I’ve added a feature to the ExternalArticles extension that allows a whole directory full of text files to be imported at once (namespaces are handled as subdirectories). More importantly, it also ‘watches’ the directories and every time a file is updated (i.e. with Ctrl-S in a text editor or IDE) it is re-imported. So this means I can have
MediaWiki:Gadget-Author.css open in PhpStorm, and just edit from there. I even have these files open inside a MediaWiki project and so autocompletion and documentation look-up works as usual for all the library code. It’s even quite a speedy set-up, luckily: I haven’t yet noticed having to wait at any time between saving some code, alt-tabbing to the browser, and hitting F5.
I dare say my bodged-together script has many flaws, but it’s working for me for now!
This is the first time I’ve done much work with the internal structure of DjVu files, and really it’s all been pretty straight-forward. A couple of odd bits about matching element and page names up between things, but once that was sorted it all seems to be working as it should.
It’s a shame that the Internet Archive has discontinued their production of DjVu files, but I guess they’ve got their reasons, and it’s not like anyone’s ever heard of DjVu anyway. I don’t suppose anyone other than Wikisource was using those files. Thankfully they’re still producing the DjVu XML that we need to make our own DjVus, and it sounds like they’re going to continue doing so (because they use the XML to produce the text versions of items).
The notes from the Wikisource hangout last night are now on Meta.
I wonder how long it takes after someone first starts editing a Wikimedia project that they figure out that they can read lots of Wikimedia news on https://en.planet.wikimedia.org/ — and when, after that, they realise they can also post to the news there? (At which point they probably give up if they haven’t already got a blog.)
Anyway, I forgot that I can post news, but then I remembered. So:
There’s going to be a Wikisource meeting next weekend (28 January, on Google Hangouts), if you’re interested in joining:
I’ve made a start of a system to pull data from Wikidata and generate a portal for the Penguin Classics, with appropriate links for those that are on Wikisource or are ready to be transcribed.
I’m a bit of a Sparql newbie, so perhaps this could’ve been done in a single query. However, I’m doing it in two stages: first, gathering all the ‘works’ that have at least one edition published by Penguin Classics, and then finding all editions of each of those works and seeing if any of them are on Wikisource. Oh, and including the ones that aren’t, too!
Wikidata:WikiProject Books sort of uses the FRBF model to represent primarily books and editions (‘editions’ being a combination of manifestation and expression levels of the FRBF; i.e. an edition realises and embodies a work). So most of the metadata we want exists at the ‘work’ level: title, author, date of first publication, genre, etc.
At the ‘edition’ level we look for a link to Wikisource (because a main-namespace item on Wikisource is an edition… although this gets messy; see below), and a link to the edition’s transcription project. Actually, we also look for these on the work itself, because often Wikidata has these properties there instead or as well — which is wrong.
Strictly speaking, the work metadata shouldn’t have anything about where the work is on Wikisource (either mainspace or Index file). The problem with adhering to this, however, is that by doing so we break interwiki links from Wikisource to Wiktionary. Because a Wikipedia article is (almost always) about a work, and we want to link a top-level Wikisource mainspace pages to this work… and the existing systems for doing this don’t allow for the intermediate step of going from Wikisource to the edition, then to the work and then to Wikipedia.
So for now, my scruffy little script looks for project links at both levels, and seems to do so successfully.
The main problem now is that there’s just not much data about these books on Wikidata! I’ll get working on that next…