MediaModeration
This section describes the configuration and operation of maintenance scripts associated with mw:Extension:MediaModeration and the MediaModeration 2.0 milestone.
Processing images manually, January 2024
As of January 2024, we are running extensions/MediaModeration/maintenance/scanFilesInScanTable.php
manually on Wikimedia Commons. The invocation on mwmaint2002
is:
# Using `tmux` mwscript extensions/MediaModeration/maintenance/scanFilesInScanTable.php --wiki=commonswiki --use-jobqueue --sleep=30 --last-checked=20240312 --verbose 2>&1 | tee ~/scan-files-in-scan-table-commonswiki-sleep-30-no-render-now.txt
Once we have completed processing the Wikimedia Commons backlog, we will shift to a new phase of the project, where we update operations/puppet
repo to process images on a daily basis.
If the script stops running, you can run the above command in a tmux session / screen on a active maintenance server to restart it.
Alerts
Per task T366165, an alert fires when the requests per second of OK requests drops below 3 per second. So far, this has happened when the script has crashed and needs to be restarted, as opposed to a general slow down in processing throughput. Alerts are sent to the #tsp-engineering channel on Slack. Incoming alerts should be silenced on https://alerts.wikimedia.org. The alert is attached to this panel.
Overview
- Add items to scan table on upload
- Obtaining thumbnail for files and sending file contents to PhotoDNA
- Distribute scanning work by image (SHA-1) using the job queue
- Use sleep to manage rate limits and target 10M requests per month
- Update
last_checked
value always. Updatemms_is_match
if PhotoDNA gives us a response - Database
- Uses an external store
- Has three columns:
mms_sha1
- can be a match with a SHA-1 infilearchive
,image
, oroldimage
tablesmms_last_checked
not a MW timestamp, instead uses a shorter format e.g.20240130
to track day but not timemms_is_match
-1
if the SHA-1 matches,0
if the SHA-1 was not a match,NULL
if no successful scan has occurred yet.
- For each SHA-1 value to be scanned, do these steps:
- Iterate over all rows in
filearchive
,image
, andoldimage
tables that have the given SHA-1:- Check if the image for this row can be scanned by PhotoDNA, otherwise continue to the next row.
- Attempt to get a suitable thumbnail for the image, and if successful then attempt to get the contents of the thumbnail
- If the thumbnail or thumbnail contents cannot be generated, then try to get the image contents. If the image contents is not suitable then continue to the next row.
- Send the image contents to PhotoDNA. If the request fails, then continue to the next row. If this is successful, then end the loop early.
- Save the new match status returned by PhotoDNA (
NULL
is the match status if no row was successfully used to scan the SHA-1). - If the new match status is positive, send an email indicating a match.
- Iterate over all rows in
Metrics
Once a day, we emit the following metrics (MediaModerationMetricsFactory
):
- the total table count of the mediamoderation_scan table for a given wiki
- the number of scanned images (
mms_is_match IS NOT NULL
) in the mediamoderation_scan table - the number of unscanned images (
mms_is_match IS NULL
) in the mediamoderation_scan table - how many unscanned images (
mms_is_match IS NULL
) which also have been previously attempted to be scanned (mms_last_checked IS NOT NULL
) are present for a wiki
The updateMetrics.php
script emits these metrics for all wikis via the mediamoderation.pp
puppet module (patch).
The metrics are visible on the MediaModeration PhotoDNA dashboard.
PhotoDNA
- Credentials are available in the Trust and Safety Product team's 1Password
- Rate limits as of January 2024:
- 200 requests per second
- 10 million requests per month