Geolocation
Geolocation is based on the MaxMind GeoIP2 database paid for by the WMF, and is used in two ways:
- Varnish adds a cookie called
GeoIP
(only if the request does not already have one), with lifetime set to the current session, in the format<ISO 3166-1 country code>:<ISO 3166-2 region code>:<city name>:<lat>:<long>:<???>
- The analytics pipeline adds geolocation data to the
geocoded_data
field of thewebrequest
table, based on the IP address.
To look up data by hand, log in to mwlog1001 or mwmaint1002 and run mmdblookup --file /usr/share/GeoIP/GeoIP2-City.mmdb --ip <IP>
(see maxmind's site for documentation of the returned data structure) or, if you just want a single field, something like mmdblookup --file /usr/share/GeoIP/GeoIP2-City.mmdb --ip <IP> country names en
.
History
Geolocation started as a Fundraising-Tech initiative introduced in 2009. Some links around how its various incarnations are/were used:
- Analytics/Geolocation, published in 2014
- Slowness of GeoIP lookup and workarounds
- gerrit:260316 documents the cookie format
- Analytics/Geowiki documents one way to use the geolocation data
- meta:Geonotice (discussing alternatives to current system for regional-level geolocation)
- "ryan lane says that our geoip service is probably not fine grained enough": meta:IRC office hours/Office hours 2012-01-12
- Extension:CentralNotice might happen to contain some information in a far future.
- meta:MaxMindCityTesting was a test of a geolocation DB data provider
- "Can 3rd parties use geoiplookup.wikimedia.org" (like ULS on another MediaWiki installation)? No known problems on WMF end.[1]
- As of 2015, GeoIP information is set by varnish in a cookie, see phabricator:diffusion/OPUP/history/production/templates/varnish/geoip.inc.vcl.erb for more information and gerrit:190964 for an overview of the system.
- As of 2016, the geoiplookup service was shut down.
Unknown country
You may encounter geo data where country is "Unknown" and country code is "--"
The primary (and perhaps only?) source of this is requests and edits made internally, such as bots running on Toolforge and other Wikimedia Cloud Services infrastructure.
They will have IP addresses starting with "10." – for example in the cu_changes table:
SELECT
cuc_ip, cuc_agent,
COUNT(1) as n_changes
FROM cu_changes
WHERE cuc_ip RLIKE '^10\\.'
GROUP BY cuc_ip, cuc_agent
ORDER BY n_changes DESC
One example is IP address "10.192.32.203" (with User-Agent ChangePropagation-JobQueue/WMF) and indeed, it is one of our servers (cf. codfw.wmnet). If we geolocate that:
ADD JAR /srv/deployment/analytics/refinery/artifacts/refinery-hive-shaded.jar;
CREATE TEMPORARY FUNCTION get_geo_data as 'org.wikimedia.analytics.refinery.hive.GetGeoDataUDF';
SELECT get_geo_data('10.192.48.103') AS geo_data;
we get:
{
"city" : "Unknown",
"subdivision" : "Unknown",
"timezone" : "Unknown",
"country_code" : "--",
"country" : "Unknown",
"continent" : "Unknown"
}