Image-suggestion/Runbook
Sometimes some of the data that the image-suggestions pipeline (or related pipelines) relies on fails to generate
Usually the first you'll hear about a failure is an email to sd-alerts@lists.wikimedia.org
, with "ImageSuggestionsTooLongSinceLastPush" in the subject line.
What to do when that happens
Go into Hive and check if the partitions that are specified in the DAGs exist. So far the most common one that has failed to generate has been wmf.wikidata_item_page_link
.
Go to #data-engineering-collab
in slack and see if anyone knows why the partition didn't generate, and if they can kick off generation.
You probably don't need to do anything else - we should be able to handle a week or two of no data, and things will just pick up from where they left off.
However, you probably will want to turn off the alert - to do that login to a stat server and run image_suggestions/data_check.py
manually.
If you do need to re-run a DAG for any reason, first pause the DAGs. Set up an ssh tunnel to the airflow server (ssh -t -N an-airflow1004.eqiad.wmnet -L 8600:127.0.0.1:8600
), and then navigate to http://localhost:8600/. Use the little slider in the DAGs tab to pause any running DAGs until everything is resolved.
Next go into Hive and check if the partitions that are specified in the DAGs exist. So far the most common one that has failed to generate has been wmf.wikidata_item_page_link
.
Once all the upstream data is ready then go to the grid view of the failed DAG you want to re-run, and click on the little red box that indicates where something has failed. Add a note to explain what happened, then click "clear" and it'll run again.
Once the failed DAG has finished you can unpause the other DAGs and they'll run in their own time
DAGs timeout
All DAGs are set to timeout after 6 days, see e.g., image suggestions. Since they are all scheduled to run weekly on Mondays, the timeout ensures no concurrent runs, as any hanging DAG stops on Sundays. The task that caused a DAG to timeout is marked with the skipped state and colored in purple in the Airflow Web UI, while the DAG itself is marked with the failed state.
Search indices
The image-suggestions DAG populates a Hive table analytics_platform_eng.image_suggestions_search_index_full
with all the data relevant to image suggestions that should exist in the search indices. It also creates analytics_platform_eng.image_suggestions_search_index_delta
, which is the difference between the latest set of image_suggestions_search_index_full
data and the equivalent dataset from discovery.cirrus_index_without_content
.
The search team have a DAG that picks up analytics_platform_eng.image_suggestions_search_index_delta
and injects the data into the search indices. If a DAG fails we write an empty partition, and the search team knows that's a noop.
Cassandra TTL
The TTL for Cassandra data is 3 weeks, so if a pipeline has been failing for a while then the Cassandra data might just disappear. You can reset the TTL like this:
- Find most recent successful DAG
- Click on the first
hive_to_Cassandra
node - Look at the rendered template to check that the snapshot is correct
- Add a note describing what you’re doing
- Click “clear” and confirm ... and then the node will re-run, resetting the Cassandra ttl in the process.