Self-hosted websites are doomed to die

I keep wanting to be able to recommend the ‘best’ way for people (who don’t like command lines) to get research stuff online. Is it Flickr, Zenodo, Internet Archive, Wikimedia, and Github? Or is it a shared hosting account on Dreamhost, running MediaWiki, WordPress, and Piwigo? I’d rather the latter! Is it really that hard to set up your own website? (I don’t think so, but I probably can’t see what I can’t see.)

Anyway, even if running your own website, one should still be putting stuff on Wikimedia projects. And even if not using it for everything, Flickr is a good place for photos (in Australia) because you can add them to the Australia in Pictures group and they’ll turn up in searches on Trove. The Internet Archive, even if not a primary and cited place for research materials, is a great place to upload wikis’ public page dumps. So it really seems that the remaining trouble with self-hosting websites is that they’re fragile and subject to complete loss if you abandon them (i.e. stop paying the bills).

My current mitigation to my own sites’ reliance on me is to create annual dumps in multiple formats, including uploading public stuff to IA, and printing some things, and burning all to Blu-ray discs that get stored in polypropylene sleeves in the dark in places I can forget to throw them out. (Of course, I deal in tiny amounts of data, and no video.)

What was it Robert Graves said in I, Claudius about the best way to ensure the survival of a document being to just leave it sitting on ones desk and not try at all to do anything special — because it’s all perfectly random anyway as to what persists, and we can not influence the universe in any meaningful way?

Extension:DocBookExport

There’s a new extension recently been added to mediawiki.org, called DocBookExport. It provides a system of defining a book’s structure (a set of pages and some title and other metadata) and then pipes the pages’ HTML through Pandoc and out into DocBook format, from where it can be turned into PDF or just downloaded as-is.

There are a few issues with getting the extension to run (e.g. it wants to write to its own directory, rather than a normal place for temporary files), and I haven’t actually managed to get it fully functioning. But the idea is interesting. Certainly, there are some limitations with Pandoc, but mostly it’s remarkably good at converting things.

It seems that DocBookExport, and any other MediaWiki export or format conversion system, works best when the wiki pages (and their templates etc.) are written with the output formats in mind. Then, one can avoid things such as web-only formatting conventions that make PDF (or epub, or man page) generation trickier.

Wikisource books for binding

I have been experimenting with turning Wikisource works into LaTeX-formatted bindable PDFs. My initial idea was to produce quatro or octavo layout sheets (i.e. 8 or 16 book pages to a sheet of paper that’s printed on both sides and has the pages layed out in such a way as when the sheet is folded the pages are in the correct order) but now I’m thinking of just using a print-on-demand service (hopefully Pediapress, because they seem pretty brilliant).

Basically, my tool downloads all of a work’s pages and subpages (in the main namespace only; it doesn’t care about the method of construction of the work) and saves the HTML for these, in order, to a html/ directory. Then (here’s the crux of the thing) it uses Pandoc to create a set of matching TeX files in an adjacent latex/ directory.

So far, so obvious. But the trouble with this approach of wanting to create a separate source format for a work is that there are changes that one wants to make to the work (either formatting or structural) that can’t be made upstream on Wikisource — but we also want to be able to bring down updates at any time from Wikisource. That is to say, this is creating a fork of the work in a different format, but it’s a fork that needs to be able to be kept up to date.

My current solution to this is to save the HTML and LaTeX files in a Git repository (one per work) and have two branches: one containing the raw un-edited HTML and LaTeX, on which the download operation can be re-run at any time; and the other being based off this, being a place to make any edits required, and which can have the first merged into it whenever that’s updated. This will sometimes result in merge conflicts, but for the most part (because the upstream changes are generally small typo fixes and the like) will happen without error.

Now I just want to automate all this a little bit more, so a new project can be created (with GitHub repo and all) with a single (albeit slow!) command.

The output ends up something like The Nether World by George Gissing.pdf.

CFB Folder 1 done

The first folder of the C.F. Barker Archives’ material is done: finished scanning and initial entry into ArchivesWiki. This is my attempt to use MediaWiki as a digital archive platform for physical records (and digitally-created ones, although they don’t feature as much in the physical folders). It’s reasonably satisfactory so far, although there’s lots that’s a bit frustrating. I’m attempting to document what I’m doing (in a Wikibook), and there’s more to figure out.

There are a few key parts to it; two stand out as a bit weird. Firstly, the structure of access control is that completely separate wikis are created for each group of access required. This can make it tricky linking things together, but makes for much clearer separation of privacy, and almost removes the possibility of things being inadvertently made public when they shouldn’t be. The second is that the File namespace is not used at all for file descriptions. Files are considered more like ‘attachments’ and their metadata is contained on main-namespace pages, where the files are displayed. This means that files are not considered to be archival items (except of course when they are; i.e. digitally-created ones!), but just representations of them, and for example multiple file types or differently cropped photos can all appear on a single item’s record. The basic idea is to have a single page that encapsulates the entire item (it doesn’t matter if the item is just a single photograph, and the system also works when the ‘item’ is an aggregate item of, for example, a whole box of photos being accessioned into ArchivesWiki).

Display Title extension

The MediaWiki Display Title extension is pretty cool. It uses a page’s display title in all links to that page. That might not sound like much, but it’s really useful to only have to change the title in one place, and have it show correctly all over the wiki. (This is much the same as Dokuwiki with the useheading configuration variable set to 1).

This is the sort of extension that I really like: it does a small thing, but does it well, and it makes sense as an addition to the core software. It’s not trying to do something completely different and just sit on top of or inside MediaWiki. It’s also not something that everyone would want, and so does belong as an extension and not an addition to core (even though the display title feature is part of core).

The other thing the Display Title extension provides is a parser function for retrieving the display title of any page: {{#getdisplaytitle:A page name}}, so you can use the display title without creating a link.

On not hosting everything

I’ve been moving all my photos to Flickr lately. It’s been a long process, one complicated by the fact that it seems silly to run my own WordPress installation (and things like ArchivesWiki) if I’m not going to bother hosting everything myself. Of course, that’s not really very logical, and so I’ve decided that it’s perfectly okay to host photos on Flickr, videos on YouTube, and all the text (and miscellaneous) stuff here on my own server.

Importing books from IA to Wikisource

After being inspired by Ewan McAndrew at Wikimania, I’ve taken a crack at doing my own video about importing books from the Internet Archive into Wikisource:

There are a few mistakes here, and as I’ve yet to figure out how to edit videos properly (I’ve only managed to hang my video editing software so far), they’ve stayed in; I’ll do another video correcting things.

The pagelist creation process is probably the hardest bit for beginners to Wikisource, and it’s something we need to work on. Metadata copying, on the other hand, mostly works fine (of course, we should not be copying the metadata, but that’s another story).

Publishing on the indiweb

I’ve been reading about POSSE and PESOS, and getting re-inspired about the value in a plurality of web tools. I sometimes try to focus just on one Software package (MediaWiki, at the moment, because it’s what I code fornear at work. But I used to love working on WordPress, and I’ve got a couple of stalled projects for Piwigo lying around. Basically, all these things will be of higher quality if they have to work with each other and with all the data silos (Facebook, Twitter, etc.).

The foundational principles of the IndiWeb are:

  1. Own your data.
  2. Use visible data for humans first, machines second. See also DRY.
  3. Build tools for yourself, not for all of your friends. It’s extremely hard to fight Metcalfe’s law: you won’t be able to convince all your friends to join the independent web. But if you build something that satisfies your own needs, but is backwards compatible for people who haven’t joined in (say, by practicing POSSE), the time and effort you’ve spent building your own tools isn’t wasted just because others haven’t joined in yet.
  4. Eat your own dogfood. Whatever you build should be for yourself. If you aren’t depending on it, why should anybody else? We call that selfdogfooding. More importantly, build the indieweb around your needs. If you design tools for some hypothetical user, they may not actually exist; if you build tools for yourself, you actually do exist. selfdogfooding is also a form of “proof of work” to help focus on productive interactions.
  5. Document your stuff. You’ve built a place to speak your mind, use it to document your processes, ideas, designs and code. At least document it for your future self.
  6. Open source your stuff! You don’t have to, of course, but if you like the existence of the indie web, making your code open source means other people can get on the indie web quicker and easier.
  7. UX and design is more important than protocols, formats, data models, schema etc. We focus on UX first, and then as we figure that out we build/develop/subset the absolutely simplest, easiest, and most minimal protocols & formats sufficient to support that UX, and nothing more. AKA UX before plumbing.
  8. Build platform agnostic platforms. The more your code is modular and composed of pieces you can swap out, the less dependent you are on a particular device, UI, templating language, API, backend language, storage model, database, platform. The more your code is modular, the greater the chance that at least some of it can and will be re-used, improved, which you can then reincorporate.
  9. Longevity. Build for the long web. If human society is able to preserve ancient papyrus, Victorian photographs and dinosaur bones, we should be able to build web technology that doesn’t require us to destroy everything we’ve done every few years in the name of progress.
  10. Plurality. With IndieWebCamp we’ve specifically chosen to encourage and embrace a diversity of approaches & implementations. This background makes the IndieWeb stronger and more resilient than any one (often monoculture) approach.
  11. Have fun. Remember that GeoCities page you built back in the mid-90s? The one with the Java applets, garish green background and seventeen animated GIFs? It may have been ugly, badly coded and sucky, but it was fun, damnit. Keep the web weird and interesting.

Wikipedia workshop this weekend

There will be a workshop at the State Library of Western Australia this Saturday from 1 p.m., for anyone to come along and learn how to add just one citation to just one Wikipedia article (or more of either, of course). For more details, see meta.wikimedia.org/wiki/WikiClubWest.