Image-suggestion/Runbook
Sometimes some of the data that the image suggestions, section topics, and/or SEAL pipelines rely on fails to generate.
Usually, the first you'll hear about a failure is an email to sd-alerts@lists.wikimedia.org
, with SLA miss on DAG=image_suggestions_weekly
and/or ImageSuggestionsTooLongSinceLastPush
in the subject line.
What to do when that happens
- Log into https://airflow-platform-eng.wikimedia.org/
- Look for any failed sensor in the DAGs. So far, the most faulty Hive tables have been:
wmf.mediawiki_wikitext_current
(monthly)wmf.wikidata_item_page_link
(weekly)wmf.wikidata_entity
(weekly)structured_data.commons_entity
(weekly)
- Post a message in
#data-engineering-collab
on Slack and see if anyone knows why the partition didn't generate, and if they can kick off generation - Post a message in
#image-suggestions-and-sections
on Slack to inform downstream that an image suggestions snapshot was skipped
We should be able to handle a week or two of no data, and things will just pick up from where they left off.
However, you probably will want to turn off the alert - to do that log into a stat server and run image_suggestions/data_check.py
manually.
If you do need to re-run a DAG for any reason, first pause the DAGs with the slider until everything is resolved. Once all the upstream data is ready then go to the grid view of the failed DAG you want to re-run, and click on the little red box that indicates where something has failed. Add a note to explain what happened, then click "clear" and it'll run again.
Once the failed DAG has finished you can unpause the other DAGs and they'll run in their own time.
DAGs timeout
All DAGs are set to timeout after 6 days as per the default configuration. Since they are all scheduled to run weekly on Thursdays, the timeout ensures no concurrent runs, as any hanging DAG stops on Wednesdays. The task that caused a DAG to timeout is marked with the skipped state and colored in purple in the Airflow Web UI, while the DAG itself is marked with the failed state.
Search indices
The ALIS DAG populates the analytics_platform_eng.image_suggestions_search_index_full
Hive table with all the data relevant to image suggestions that should exist in the search indices. It also creates analytics_platform_eng.image_suggestions_search_index_delta
, which is the difference between the latest set of image_suggestions_search_index_full
data and the equivalent dataset from discovery.cirrus_index_without_content
.
The search team has a DAG that picks up analytics_platform_eng.image_suggestions_search_index_delta
and injects the data into the search indices.
If a DAG fails we write an empty partition, and the search team knows that's a noop.
Commons' threshold
If the amount of rows in Commons' delta/diff is above a threshold specified in ALIS' DAG variables, it won't be shipped.
Make sure you override the value if needed:
- Log into https://airflow-platform-eng.wikimedia.org/
- In the top tabs, go to
Admin > Variables
- Click on the
Edit record
icon on the left ofplatform_eng/dags/alis_dag.py
Key - Set the
commons_delta_threshold
value
Cassandra TTL
The TTL for Cassandra data is 3 weeks, so if a pipeline has been failing for a while then the Cassandra data might just disappear.
You can reset the TTL as follows. Ensure that the cassandra
DAG is paused before starting and is unpaused at the end.
- Click on the DAG
- Find the last successful run (green bar)
- Click on the
feed_suggestions
square - Scroll down to check that the snapshot is correct
- Add a note describing what you’re doing
- Click on the green bar
- Click
Clear
orSHIFT-c
and confirm
ALIS and SLIS with different snapshots
If ALIS and SLIS have a different snapshot, e.g., 2025-01-20
and 2024-12-23
respectively, then the DAG must run twice.
Instructions for ALIS:
- Verify that the ALIS snapshot has zero SLIS
- In the top tabs, go to
Admin > Variables
- Click on the
Edit record
icon on the left ofplatform_eng/dags/cassandra_dag.py
Key - Fill
weekly_snapshot
with the correct snapshot and save - Go back to the
cassandra
DAG - Click on the green bar
- Click
Clear
orSHIFT-c
and confirm - Click on the
wait_for_SLIS
square - Click
Mark state as... > failed
orSHIFT-f
and confirm
Then, repeat the same procedure for SLIS, making sure the snapshot has zero ALIS and that you fail wait_for_ALIS
instead.
Production deployment
If you release a new version of a pipeline and bump relevant target DAGs, the change will be automatically deployed.
Make sure the conda environment DAG variable is wiped out, or Airflow won't pick up the new pipeline release:
- Log into https://airflow-platform-eng.wikimedia.org/
- In the top tabs, go to
Admin > Variables
- For each relevant DAG, click on the
Edit record
icon on its left - Delete the row starting with
"conda_env"
and save
Switch release
If you want to use a different pipeline release, override the "conda_env"
value with the version you want to run and save.