OCR on Wikisource

Fremantle

· Wikisource · OCR · Wikimedia · transcription ·

I've been attempting this weekend to get back to sorting out some of the OCR tool's nomenclature around languages and text recognition models. It's the sort of job that's not too hard but touches lots of bits of code, and in this case two separate codebases, so any changes are easier to do piecemeal and must maintain backwards compatibility. When the first Wikisource OCR tools were built, they used Tesseract initially, and Google Cloud Vision after that, and both of those talk about 'languages' as one of the parameters to set when OCRing an image. Google goes as far as saying you must use BCP-47 identifiers.

This is what the on-wiki dialog looks like (with the new label).

But they're not really 'languages' — you can, for instance, tell Tesseract to use Cyrillic (i.e. a writing system used by quite a few languages) — and when we added Transkribus it started to become even clearer that we needed to do something to reduce the confusion around this (Transkribus puts the idea of trained models front and centre).

After all, it does make sense to not think of OCR in terms of language — many languages are written with similar scripts, and OCR is all about shapes and patterns and the likelihood of certain blobs of ink being intended to be particular characters or lines of text. It doesn't care about grammar or meanings or syntax or morphology (although do note that I'm not a linguist nor do I actually know anything about OCR or computer vision!).

Does "text recognition model" mean anything to Wikisource users though? I guess the term 'model' is pretty widespread at the moment (thanks to all this AI bollocks), so perhaps it's clear enough. And it will hopefully separate the ideas of a given Wikisource's content language from what OCR model should be picked for any given work (i.e. they're often the same, and we do set a default for each Wikisource, but a different model might work better for any particular scanned work).

← PreviousNext →
Comments on this post
No comments yet