Self-hosted websites are doomed to die

I keep wanting to be able to recommend the ‘best’ way for people (who don’t like command lines) to get research stuff online. Is it Flickr, Zenodo, Internet Archive, Wikimedia, and Github? Or is it a shared hosting account on Dreamhost, running MediaWiki, WordPress, and Piwigo? I’d rather the latter! Is it really that hard to set up your own website? (I don’t think so, but I probably can’t see what I can’t see.)

Anyway, even if running your own website, one should still be putting stuff on Wikimedia projects. And even if not using it for everything, Flickr is a good place for photos (in Australia) because you can add them to the Australia in Pictures group and they’ll turn up in searches on Trove. The Internet Archive, even if not a primary and cited place for research materials, is a great place to upload wikis’ public page dumps. So it really seems that the remaining trouble with self-hosting websites is that they’re fragile and subject to complete loss if you abandon them (i.e. stop paying the bills).

My current mitigation to my own sites’ reliance on me is to create annual dumps in multiple formats, including uploading public stuff to IA, and printing some things, and burning all to Blu-ray discs that get stored in polypropylene sleeves in the dark in places I can forget to throw them out. (Of course, I deal in tiny amounts of data, and no video.)

What was it Robert Graves said in I, Claudius about the best way to ensure the survival of a document being to just leave it sitting on ones desk and not try at all to do anything special — because it’s all perfectly random anyway as to what persists, and we can not influence the universe in any meaningful way?

Trello backup script

Trello can be quite a useful task-tracker, but it’s got the usual pitfall of being a cloud service that might change at any time and become unusable (expensive or stupid or whatever).

Luckily, it’s a simple matter to set up a daily cronjob to run Matthieu Aubry’s trello-backup script. It produces a single JSON file for each board.

Now, restoring from that file might be a different matter…

Planet Freo back online

Sorry to everyone who noticed; Planet Freo‘s been offline for 48 hours. I thought I needed access to my home machine to fix it; turns out I didn’t, but anyway I waited till I was home (and powered by a dram of Ardmore) to fix it. It is now fixed.

I’ve updated the FreoWiki page that lists the feeds (if anyone’s keeping track of who’s been censured).

Anyone know of other blogs that should be on the list? Let me know!

forking wikis

I wish wikis were less collaborative! I wish they were more like software projects, where if one wants to modify anything, one gets one’s own copy and does anything at all to it.

No, I’m not really saying that there should be fewer centralised places of communal effort, these things are great… I just want a good way to branch and modify non-code content.

A cross between the Internet Archive’s system for uploading content into their collections, and Github’s user-centric arrangement.

The problem seems to often come back to the formats that things are in. It’s easy in the text-only code world; but wiki’s each have their own markup…

I wondered about the use of MediaWiki, and pulling in remote articles (periodic synchronisation), but of course there’s no merging in that idea, so it doesn’t work. It’s what Printable WeRelate does, but I’m yet to quite figure out how that’s going to deal with local additions to the data (probably, pages will be quite separate, with links only going from the local-only content to the remote-sync’d stuff; because we can’t modify the remote articles locally, and links in them when they’re elsewhere wouldn’t make sense).

So, there’s no solution: I’ll stick to centralised editing and storage, but carry on pulling backups (huzza to Wikiteam).

Backing up thunderbird email

When backing up Thunderbird, the only files I worry about are the actual mbox files that store the ‘Local Folders’ (archived) email, and the *.mab addresbook files. Everything else is operational cruft. This might seem a bit extreme—after all, why not backup the account configurations and user preferences etc.—but I jump from machine to machine often enough, and reinstall things so regularly, that setting up a few email accounts now and then is too easy, and I prefer the minimalism. This way, I know exactly what I’m backing up and where my emails are, and I’m only backing up what’s essential. I’ve tested restoring to other email clients too (like sylpheed), and all is seamless and heartening.

This minimal backup works too because I use the ‘archive’ function of Thunderbird, which is just a simple “copy to date-based (i.e. year-based) folder hierarchy in Local Folders” function, activated by pressing a. Hence, I don’t bother filing emails by topic, and I store all sent items in the same folders are those received (yeah, I’m not suggesting that anyone else is ever going to find my system at all sensible). So the files backed up are small in number and never disappear (new ones are added, is all). I don’t back up what’s still up in the IMAP server, but then there’s not much of that at any given time.

The last part of my backup is to include all the files, both mbox and mab, in a version control system (in this case, Subversion). This way, I can roll back any file to any previous revision, easily. They’re all text, so the revision space-usage is efficient and of no worry; I’m only talking about half a gig per year anyway.

The script that does all this for me is simple:

#!/bin/bash

TB_PROFILE=/home/sam/.thunderbird/po2p7m5o.default/

MBOXEN=$(dirname $0)"/mboxen"
ABOOKS=$(dirname $0)"/abooks"

echo "Copying addressbooks to $ABOOKS..."
cp -v "$TB_PROFILE"*.mab "$ABOOKS/."

echo "Copying mail files to $MBOXEN..."
rsync -rv --exclude=*.msf "$TB_PROFILE/Mail/Local Folders/Archives.sbd/" $MBOXEN

# ...Followed by a svn commit

mbox is a pretty ridiculous format, really. It’s based on the idea that it can determine the beginning of each email by the fact that the word ‘From’ starts a line and is followed by a space. That’s it! Daft. Thunderbird supposedly has some greater means of delimiting messages, but still I’ve on a number of occasions had email corrupted due to this silliness. Not hard to recover from, usually.

Better sync’ing for Printable WeRelate

Printable WeRelate now will synchronize all ‘starting-points’ pages (i.e. any page with a <printablewerelate> element), rather than being required to have just a single page listed on the command-line. This means that a cron job just needs to call sync.php at some interval (maybe nightly? of course, at some unusual number of minutes past the hour), and everything will be brought up to date.

This change is now available on the test site.

I’m now working on the display of the data within the wiki. Including the addition of a notice (maybe just a template call: something like {{WeRelate page}}) to the effect that “this is a copy of a page from WeRelate.org and should not be edited. All changes should be made at [url].” to send people back to WeRelate. Probably also actually prevent the editing of the synchronized pages.

Setting up USB drives for backup

I use USB hard drives for backing up one of my machines, swapping them regularly but leaving everything else up to the backup script that runs daily. This means that I want to mount them at the same place every time, regardless of which drive I plug in or what device it is registered as. This isn’t very difficult because fstab can use UUIDs or labels to identify disks:

UUID=6B70-A309    /media/sw_backup vfat user 0 0
LABEL="SW_BACKUP" /media/sw_backup vfat user 0 0

(Note: these backup drives are formatted with FAT filesystems so that I can if need be restore on any system if required.)

To avoid having to manually add the disk every time I put a new one into rotation, I go with the label method.

To use this, each disk must be given the same label (and then not plugged in at the same time!). To set the label, first find the device:

sw@swbackup:~/backups$ sudo blkid
/dev/sda3: UUID="f31d1291-9d6f-441d-9f8d-fa34e9f569d5" TYPE="swap"
/dev/sda4: UUID="8a0b99a2-8a2e-4eae-7666-d607fbc44de5" TYPE="ext4"
/dev/sdb1: LABEL="NONAME" UUID="4A39-C8E7" TYPE="vfat"

Then sudoedit /etc/mtools.conf to add the following, where the device name is the same as above:

mtools_skip_check=1
drive s: file="/dev/sdb1"

Now mtools can change the label:

sw@swbackup:~/backups$ sudo mlabel -s s:
 Volume label is NONAME
sw@swbackup:~/backups$ sudo mlabel s:SW_BACKUP
sw@swbackup:~/backups$ sudo mlabel -s s:
 Volume label is SW_BACKUP

rsync.net

I signed up for an rsync.net account a bit over a month ago. They’re a reasonably-priced off-site filesystem provider, seemingly run by people who care about security and doing things normally. By ‘normally’, I mean rsync for starters (oddly enough, given their name) but also the whole gammut of *nix-y ways of doing things; one can interact with them with the usual tools. So they provide a proper, old-fashioned filesystem, and protect it well (there’s even a warrent canary). There’s a choice of datacentre — I chose the Zurich one — and plans ranging from 7GB (80c/GB/month) to 10TB (8c/GB/month). They even correspond via email, of all things! It really is odd that a company that behaves so normally is so uncommon…. I don’t care about pretty graphics, boring and unused extra features, or ‘enterprise-readiness’ (whatever the flip that is), I just want a share of some disk in a big strong building somewhere, one that’s going to be protected and maintained properly and simply. All I can say so far is three cheers for rsync.net. (I’ll be sure to report back if my opinion changes.)

So that’s all well and good, and I’ve got my big disk in the sky, but how am I going to use it? I am going to host a Subversion repository there, to serve as an everything-bucket. That is all. How well will svn handle a huge (multi-gigabyte) repository? I’ve heard varying reports, but most seem to think it’ll be fine. Certainly it’s data-copying system will work well, as far as resuming aborted connections goes (it’ll only copy what’s not yet been copied; much as rsync.net does (although I don’t think it does it at any smaller unit than that of the whole file)). Questions remain about how much overhead diskspace I’ll waste by doing this, but as most of the binary files will only be modified at most once or twice, and generally not at all once they’re checked-in, I don’t think it’ll matter too much.

I’ll see how things go.