Data Platform/AQS/Wikistats 2/DataQuality/VettingPerProjectFamilies
Around the end of 2018 we added support for project families (all wikipedias, all wiktionaries, all wikivoyages...) to a number of contributor-related metrics in Wikistats 2. To verify that the numbers for these metrics are as close as possible to reality we compared them to the canonical source of statistics for Wikipedias (the original version of Wikimedia Statistics). This study is similar to the one conducted at the beginning of 2018 for wiki projects, and is equally restricted to the Wikipedia project family.
A total of three metrics have been vetted: edits, average of article creations per day and total article count. One of the metrics, New Registered Users, is not available in Wikistats 1, and another one, Editors by Activity Level, is not available yet in Wikistats 2. Those two metrics have been excluded from this report.
Data comparisons show no significant differences to the findings in the per-project study, with a few caveats described below.
Metric | Summary |
---|---|
Edits | Taking into account nostalgia wikipedia, the numbers remain consistent, including the underreporting of edits in 2004 and 2005. |
Average of new articles per day | Data is almost the same in Wikistats and Wikistats 2, except for the last two years where Wikistats 1 wasn't reporting a number of wikis. |
Total article count | table|All|page_type~content content-only pages to date] in Wikistats 2. |
Analysis
Definition of project families
We consider a family any of the big Wikimedia wikis that has a version thereof for each language. Therefore sites like wikidata, mediawiki, meta-wiki and wikitech are not considered project families. This is the complete list of project families that can be queried with Wikistats2.
- Wikipedia
- Wikiquote
- Wikibooks
- Wiktionary
- Wikisource
- Wikiversity
- Wikivoyage
- Wikinews
Wikistats 1 only has data for the "Wikipedia" family thus this the only family for which we can attempt to vet the data.
Obtention of Wikistats 2 data
Tabular data for each of the Wikistats 2 metrics was obtained through the Wikistats 2 UI, selecting the applicable breakdowns as described below (content-only new pages, content-only edits, etc.) and pressing the download button. These files were then imported into the Google Sheets worksheet linked at the bottom of this page.
The Nostalgia Wikipedia case
For all metrics explained below, the January 2001 - January 2002 period shows a consistent burst of over-reporting in Wikistats 2, even taking into account the variation described in the per-project vetting report. After some digging, the conclusion was that Nostalgia Wikipedia (nostalgiawiki) was never added to the list of Wikistats 1 metrics, since its content is just a snapshot of English WIkipedia as of December 2001, therefore all its editing activity being a repeat of that from enwiki in that period.
Edits
The numbers in this page on Wikistats 1 correspond to content-only edits on Wikistats 2. Taking into account nostalgia wikipedia, the numbers remain consistent, including the underreporting of edits in 2004 and 2005.
Average of new articles per day
The numbers in the page of Wikistats 1 correspond to content-only new pages on Wikistats 2.
Present-day variation
In the graph above, it seems the closer the date is to the present, the more creations are overestimated when compared to Wikistats 1. This is because WIkistats 1 kept track, by the end of its lifetime, of 278 Wikipedias, while the Data Lake contains data for 313.
Unreported wikis: aa, ady, azb, bat-smg, be-tarask, be-x-old, cbk-zam, cho, commons, din, dty, gag, gor, ho, hz, ii, inh, jam, kbp, kj, kr, lfn, lrc, map-bms, mh, mus, nds-nl, ng, nostalgia, olo, pfl, roa-tara, sat, sat, shn, ten, test, test2, wg-en, xmf, zh-classical, zh-yue.
Full data for unreported wikis
Total article count
Variation on this metric is extremely high on the the 2001-2003 period, but then remains consistent way below 1%. The metric in Wikistats 1 corresponds to content-only pages to date in Wikistats 2.
Calculations and additional info of interest
This study has been conducted using the January 2019 mediawiki history snapshot. Unless stated otherwise, the date range of the report's data is January 2001 to December 2018, when the last computation batch was done in Wikistats 1. All variation numbers are expressed as the variation (as a percentage) between Wikistats2 and Wikistats1 reported numbers.
Google Drive worksheet for this study
Go to the main article on per project metrics editing data quality.