WMDE/Wikidata/ORES
ORES is a machine learning platform maintained by WMF. We use ORES extensively in Wikidata and have developed parts of it related to Wikidata. Most notably item quality model.
How to build item quality model
For models, you need to find the repo that holds the model and its features. For item quality it's articlequality. Clone it.
After installing python requirements (requirements.txt file), you can simply run "make wikidatawiki_models". The Makefile has more details on what commands will be run and what can cause issues. The makefile will also create the model info file which exists in model_info/wikidatawiki.item_quality.md. It also creates the model binary file in models/wikidatawiki.item_quality.gradient_boosting.model.
Model info file has critical information on ratio of false positives, thresholds, ROC-AUC, etc.
To test the model file you can simply run such python code:
import mwapi
from revscoring import Model
from revscoring.extractors.api.extractor import Extractor
with open("models/wikidatawiki.item_quality.gradient_boosting.model") as f:
scorer_model = Model.load(f)
extractor = Extractor(mwapi.Session(host="https://www.wikidata.org",
user_agent="test"))
feature_values = list(extractor.extract(123456789, scorer_model.features))
print(scorer_model.score(feature_values))
Preparing the model for deployment
If you added features and you're sure it'll improve the system, after getting the features reviewed and merged, you need to retrain the model. The binary files merged to the articlequality repo must be the exact same as production (yeah, I know...). You need to find out what OS ores100x is on (currently stretch) and retrain the model in on a VM that has the same OS. Retraining the model as said is simply running "make wikidata_models". If this command doesn't do anything, simply delete the model file (and datasets/wikidatawiki.labeling_revisions.w_cache*
) first.
If you're adding more labeled data (and not adding features), you need to change the makefile, look at dependencies of the wikidata model in the make file, look where the datasets are coming from and how they are built, simply add another dataset, cat it in the final step and then run "make wikidata_models" to trigger a retrain.
For deployment, follow the general guideline: ORES/Deployment
Run the model on a dump
If you re-trained the model and now want to have scores on revisions of wikidata based on a dump, you can run extract_scores
utility in articlequality (there is no guarantee it would work, it's ancient, you might have to tweak it). Here's the bash file that produces the dump monthly in stat1005:
month=$(date +"%Y%m")
day="${month}01"
source /home/ladsgroup/p3/bin/activate
cd /home/ladsgroup/articlequality
./utility extract_scores /mnt/data/xmldatadumps/public/wikidatawiki/${day}/wikidatawiki-${day}-pages-articles[1234567890]?*.xml-*.bz2 --model models/wikidatawiki.item_quality.gradient_boosting.model --sunset ${day}000000 --processes=10 --score-at monthly --verbose > run_${month}.out 2> run_${month}.err
grep ${month} run_${month}.out > run_${month}_1.out
mv run_${month}_1.out run_${month}.out
mv run_${month}.out wikidata_quality_snapshot_${month}.tsv
gzip -k wikidata_quality_snapshot_${month}.tsv
The resulting output is something like:
page_id item_id rev_id timestamp class weighted_sum 15791782 Q14126127 1006231840 20191001000000 C 2.9696165555741985 21934496 Q20219489 1022984964 20191001000000 D 2.029816923665906 20434497 Q18881996 900205935 20191001000000 C 3.3918374898232493 25711961 Q23708018 982636318 20191001000000 C 2.91177394658644 23914320 Q21877494 1013198591 20191001000000 C 2.9855789615867576 14840084 Q13223924 881834471 20191001000000 C 3.0241936745417326 25414320 Q23405795 914178204 20191001000000 C 2.918603674849337 21934498 Q20219490 1015121079 20191001000000 D 2.029353280459076 23914321 Q21877495 1011135621 20191001000000 B 3.555012588220033