Fremantle
· MediaWiki · websites · Apache ·
I've been doing a bit more today in locking parts of Freopedia against annoyingly persistent bots. This isn't a new topic, but these days it's getting silly with the increase from the bastard AI scrapers. A new page, handling web crawlers, has been set up to collect details about how to detail with them, but I thought I'd record my notes here for my own reference later.
I started by fixing up Apache's default log format, to give it some tab characters for easier cutting:
LogFormat "%h %t %v %u %>s \"%{Referer}i\" \"%{User-agent}i\" sec:%T" myformat
CustomLog ${APACHE_LOG_DIR}/access.log myformat
The log format has lots of variables available, the ones here are as follows:
%hRemote IP address.%tDate and time.%vServerName (as defined in the relevant VirtualHost section).%UURL path (with a leading/).%qQuery string (with a leading?). (The server name, path, and query string are separated for easier analysis if needed.)%>sThe final HTTP status code.\"%{Referer}i\"The Referer header.\"%{User-agent}i\"The User-agent header.sec:%TNumber of seconds taken to serve the request (with a prefix to remind me of what it is).
Then I want to block non-logged in users from accessing a couple of expensive special pages. This could be done with the new CrawlerProtection extension, but that's not compatible with the latest MediaWiki yet so I'm doing it in Apache for now (which is probably more efficient anyway, although it does mean a worse-looking 403 error page):
<If "%{HTTP_COOKIE} =~ /[-a-zA-Z_]+UserID=/ || ! %{QUERY_STRING} =~ /(RecentChangesLinked|WhatLinksHere)/">
Require all granted
</If>
<Else>
Require all denied
</Else>
This looks for a cookie named <wiki-ID>UserID, which is set when a user is logged in. The query string check isn't very robust (it blocks other pages with those strings) but that's fine for now. I might try to improve this at some point (e.g. add history and diffs to the prohibition) and add it to the above page.
The Cargo drilldown page seems to be a bit of a magnet for scrapers, I guess because it has lots of links and if you follow these then you can execute all sorts of possibly slow queries. This can be locked down with a user right:
$wgGroupPermissions['*']['runcargoqueries'] = false;
$wgGroupPermissions['user']['runcargoqueries'] = true;
There's more to be done, but for now removing access to RecentChangesLinked and WhatLinksHere has had the biggest effect on my uptime, and I've postponed once again the day when I have to abandon MediaWiki in favour of a hard to edit but blisteringly fast static HTML site.
These changes seem to be improving things somewhat (note the slight decrease towards the end of 23 May):
Still do to is to prevent arbitrary sizes being created for image thumbmails.
My main RSS news feed: https://samwilson.id.au/news.rss
(or Wikimedia.rss, Fremantle.rss, OpenStreetMap.rss, etc. for topic feeds).
Email me at sam or leave a comment below…
samwilson.id.au
