Jump to content

Image-suggestion/Runbook

From Wikitech

Sometimes some of the data that the image-suggestions pipeline (or related pipelines) relies on fails to generate

Usually the first you'll hear about a failure is an email to sd-alerts@lists.wikimedia.org, with "ImageSuggestionsTooLongSinceLastPush" in the subject line.

What to do when that happens

Go into Hive and check if the partitions that are specified in the DAGs exist. So far the most common one that has failed to generate has been wmf.wikidata_item_page_link.

Go to #data-engineering-collab in slack and see if anyone knows why the partition didn't generate, and if they can kick off generation.

You probably don't need to do anything else - we should be able to handle a week or two of no data, and things will just pick up from where they left off.

However, you probably will want to turn off the alert - to do that login to a stat server and run image_suggestions/data_check.py manually.

If you do need to re-run a DAG for any reason, first pause the DAGs. Set up an ssh tunnel to the airflow server (ssh -t -N an-airflow1004.eqiad.wmnet -L 8600:127.0.0.1:8600), and then navigate to http://localhost:8600/. Use the little slider in the DAGs tab to pause any running DAGs until everything is resolved.

Next go into Hive and check if the partitions that are specified in the DAGs exist. So far the most common one that has failed to generate has been wmf.wikidata_item_page_link.

Once all the upstream data is ready then go to the grid view of the failed DAG you want to re-run, and click on the little red box that indicates where something has failed. Add a note to explain what happened, then click "clear" and it'll run again.

Once the failed DAG has finished you can unpause the other DAGs and they'll run in their own time

DAGs timeout

All DAGs are set to timeout after 6 days, see e.g., image suggestions. Since they are all scheduled to run weekly on Mondays, the timeout ensures no concurrent runs, as any hanging DAG stops on Sundays. The task that caused a DAG to timeout is marked with the skipped state and colored in purple in the Airflow Web UI, while the DAG itself is marked with the failed state.

Search indices

The image-suggestions DAG populates a Hive table analytics_platform_eng.image_suggestions_search_index_full with all the data relevant to image suggestions that should exist in the search indices. It also creates analytics_platform_eng.image_suggestions_search_index_delta, which is the difference between the latest set of image_suggestions_search_index_full data and the equivalent dataset from discovery.cirrus_index_without_content.

The search team have a DAG that picks up analytics_platform_eng.image_suggestions_search_index_delta and injects the data into the search indices. If a DAG fails we write an empty partition, and the search team knows that's a noop.

Cassandra TTL

The TTL for Cassandra data is 3 weeks, so if a pipeline has been failing for a while then the Cassandra data might just disappear. You can reset the TTL like this:

  • Find most recent successful DAG
  • Click on the first hive_to_Cassandra node
  • Look at the rendered template to check that the snapshot is correct
  • Add a note describing what you’re doing
  • Click “clear” and confirm ... and then the node will re-run, resetting the Cassandra ttl in the process.