Data Platform/Data Lake/Edits/Geoeditors
This is a private dataset that shows, in aggregate, what countries editors edit from. It is available only on the private cluster for now as we work out how to publish it for broader use. You can access this dataset on Superset or in the wmf.geoeditors_monthly
Hive table.
Public versions of this data:
- Analytics/Data_Lake/Edits/Geoeditors/Public
- AQS API: Editors by country
- meta:Differential_privacy/Completed/Geoeditors
For details on the infrastructure that produces this data source, see Analytics/Systems/Geoeditors.
Aggregated data
There are three datasets available with aggregated data:
wmf.geoeditors_monthly
counts editors by country and by wiki. Further, we publish two versions of this data: a yearly aggregate described below and a public bucketed version in the form of files that are updated every month.geoeditors_edits_monthly
counts edits by country and by wikiunique_editors_by_country_monthly
counts editors by country regardless of wiki.
Geoeditors Monthly
The wmf.geoeditors_monthly
table is defined as:
column | data type | description |
---|---|---|
wiki_db | string | The wiki database the editors worked in |
country_code | string | The 2-letter ISO country code this group of editors geolocated to, including Unknown (--) |
users_are_anonymous | boolean | Whether or not this group of editors edited anonymously |
activity_level | string | How many edits this group of editors performed
possible values: "1 to 4", "5 to 99", "100 or more" |
distinct_editors | bigint | Number of editors meeting this activity level |
namespace_zero_distinct_editors | bigint | Number of editors meeting this activity level, with only namespace zero edits |
month | string | [partition] The month in YYYY-MM format |
NOTE: while users are grouped by their user_name
[1], which is unique across all wikis, anonymous users are grouped by a hash of their User Agent and IP. This can still group multiple editors as one, but it should be rare enough that the numbers are indicative of anonymous editing patterns. Bots are identified as well as possible and filtered out, so this data does not include any bot activity we can identify.
From this table, you can find the number of editors, with different activity levels, anonymous or logged-in, in a specific project, editing from a specific country. An example query gets all this data for French Wikipedia in March 2018:
select
country_code,
users_are_anonymous,
activity_level,
sum(distinct_editors) as editor_count
from wmf.geoeditors_monthly
where month='2018-03' and wiki_db = 'frwiki'
group by country_code, users_are_anonymous, activity_level;
Geoeditors Edits Monthly
The geoeditors_edits_monthly
table is defined as:
column | data type | description |
---|---|---|
wiki_db | string | The wiki database the editors worked in |
country_code | string | The 2-letter ISO country code this group of edits geolocated to, including Unknown (--) |
edits_are_anonymous | boolean | Whether or not this group of edits were made anonymously |
edit_count | bigint | Number of edits |
namespace_zero_edit_count | bigint | Number of namespace zero edits |
month | string | [partition] The month in YYYY-MM format |
This is queried in similar fashion as geoeditor_edits
. It is used primarily to report statistics to the Global Innovation Index (GII).
Unique Editors By Country Monthly
The wmf.unique_editors_by_country_monthly
is similar to wmf.geoeditors_monthly
, however, the data is not aggregated by wiki, thus allowing per country editor counts regardless of any particular project. The table is defined as:
column | data type | description |
---|---|---|
country_code | string | The 2-letter ISO country code this group of editors geolocated to, including Unknown (--) |
users_are_anonymous | boolean | Whether or not this group of editors edited anonymously |
activity_level | string | How many edits this group of editors performed
possible values: "1 to 4", "5 to 99", "100 or more" |
distinct_editors | bigint | Number of editors meeting this activity level |
namespace_zero_distinct_editors | bigint | Number of editors meeting this activity level, with only namespace zero edits |
month | string | [partition] The month in YYYY-MM format |
It is used primarily to report metrics and dashboards by the Global Data and Insights Team.
Archived report: Geoeditors edits yearly
A yearly report is generated presenting how many edits have been made by country and by wiki over the year. The report is split in two files, one contains edits made on all wikipedia projects, the other contains edits made on wikidata project.
Public bucketed version
A monthly report similar to the Geoeditors monthly, but made for public use.
Access
The easiest way to get to this data is by using Superset, a UI that you can use to access most of our datasources. The example geoeditors dashboard is at https://superset.wikimedia.org/superset/dashboard/9. As an example, here's a map generated from the data:
Raw data
To compile the geoeditors_monthly
dataset, we import MediaWiki's cu_changes
table and filter out all the entries for administrative actions. This result is available as the wmf_raw.mediawiki_private_cu_changes
table. We then aggregate at the day and user level, labeling anonymous users and generate the wmf.editors_daily
table.
Since this raw data identifies the location of individual editors, we keep it for only 90 days, in accordance with our data retention guidelines. Data older than 90 days is continuously purged from the source cu_changes
table, but since we regenerate the Data Lake's editing data every month, we instead keep data in mediawiki_private_cu_changes
and editors_daily
for the two latest calendar months (the month of the latest mediawiki_history snapshot and the previous). Older data may be temporarily available before it is purged, but you should not rely on this.
Editors daily
This can be found in wmf.editors_daily
column | data type | description |
---|---|---|
wiki_db | string | The wiki database of origin |
country_code | string | The 2-letter ISO country code this group of edits geolocated to, including Unknown (--) |
user_fingerprint_or_name[1] | string | For registered users, it is the global username, unique across all wiki dbs. For anonymous users, it is a hash of IP + user agent. |
user_is_anonymous | boolean | Whether or not this user edited this group of edits anonymously |
date | string | The YYYY-MM-DD date for this group of edits. Note that date is a reserved word in SQL, so you will need to quote it using backticks in Hive or Spark (`date` ) or double quotes in Presto ("date" ).
|
edit_count | bigint | The total count of edits for this grouping |
namespace_zero_edit_count | bigint | The total count of edits to namespace zero for this grouping |
user_is_bot_by | array<string> | Whether this user is identified as a bot, values can be: group, name, both or empty |
action_type | int | The action type for this group of actions - see https://www.mediawiki.org/wiki/Manual:Recentchanges_table#rc_type |
month | string | [partition] The month in YYYY-MM format, all edits from that month from the cu_changes table are aggregated in this partition |
wmf_raw.mediawiki_private_cu_changes
cuc_id bigint 'Primary key, artificial', cuc_namespace bigint 'When pages are renamed, their RC entries do _not_ change.', cuc_title string 'When pages are renamed, their RC entries do _not_ change.', cuc_user bigint 'user_id, rev_user, the id of the user performing the action', cuc_user_text string 'user_text, rev_user_text, the name of the user at the time', cuc_actiontext string 'Unknown, undocumented', cuc_comment string 'The revision comment', cuc_minor boolean 'Whether this was a minor edit', cuc_page_id bigint 'The id of the page being edited', cuc_this_oldid bigint 'rev_id of the revision represented by this change', cuc_last_oldid bigint 'rev_id of the previous revision on this page', cuc_type int 'see https://www.mediawiki.org/wiki/Manual:Recentchanges_table#rc_type', cuc_timestamp string 'mediawiki-formatted timestamp of this event', cuc_ip string 'clear-text IP address of the user responsible for this event', cuc_agent string 'clear-text user agent of the user responsible for this event'
Changes and known problems since 2019-09
- See also m:Research:Page view#Change log for changes to the page view definition itself, as well as Analytics/Data_Lake/Traffic/Pageview_hourly#Changes and known problems since 2015-06-16
Date from | Date until | Task | record_version | Details |
---|---|---|---|---|
2018-01 | 2019-08 | T237072 | n/a | The count of namespace_zero_distinct_editors was calculated incorrectly for this period of time, for the "1 to 4" activity level. We replaced the invalid values with null values. All other activity levels and measures are fine for this period of time. |
Data loss 2017
- 2017: November 13th to November 20th, data not computed while the old database was decommissioned. Data will be lost if it's not re-computed by December 13th. As of this writing, we do not have capacity to save this data.
- 2017: October 25th to November 4th, data not computed while changing database hosts. Data will be lost if it's not re-computed by November 24th (as of this writing, a few days have already been lost)
Notes
- ↑ 1.0 1.1 Until August 2022, user_fingerprint_or_id, that is locally unique to a wiki_db was used.