Jump to content

Image-suggestion/Runbook

From Wikitech

Sometimes some of the data that the image suggestions, section topics, and/or SEAL pipelines rely on fails to generate.

Usually, the first you'll hear about a failure is an email to sd-alerts@lists.wikimedia.org, with SLA miss on DAG=image_suggestions_weekly and/or ImageSuggestionsTooLongSinceLastPush in the subject line.

What to do when that happens

  • Log into https://airflow-platform-eng.wikimedia.org/
  • Look for any failed sensor in the DAGs. So far, the most faulty Hive tables have been:
    • wmf.mediawiki_wikitext_current (monthly)
    • wmf.wikidata_item_page_link (weekly)
    • wmf.wikidata_entity (weekly)
    • structured_data.commons_entity (weekly)
  • Post a message in #data-engineering-collab on Slack and see if anyone knows why the partition didn't generate, and if they can kick off generation
  • Post a message in #image-suggestions-and-sections on Slack to inform downstream that an image suggestions snapshot was skipped

We should be able to handle a week or two of no data, and things will just pick up from where they left off. However, you probably will want to turn off the alert - to do that log into a stat server and run image_suggestions/data_check.py manually.

If you do need to re-run a DAG for any reason, first pause the DAGs with the slider until everything is resolved. Once all the upstream data is ready then go to the grid view of the failed DAG you want to re-run, and click on the little red box that indicates where something has failed. Add a note to explain what happened, then click "clear" and it'll run again.

Once the failed DAG has finished you can unpause the other DAGs and they'll run in their own time.

DAGs timeout

All DAGs are set to timeout after 6 days as per the default configuration. Since they are all scheduled to run weekly on Thursdays, the timeout ensures no concurrent runs, as any hanging DAG stops on Wednesdays. The task that caused a DAG to timeout is marked with the skipped state and colored in purple in the Airflow Web UI, while the DAG itself is marked with the failed state.

Search indices

The ALIS DAG populates the analytics_platform_eng.image_suggestions_search_index_full Hive table with all the data relevant to image suggestions that should exist in the search indices. It also creates analytics_platform_eng.image_suggestions_search_index_delta, which is the difference between the latest set of image_suggestions_search_index_full data and the equivalent dataset from discovery.cirrus_index_without_content.

The search team has a DAG that picks up analytics_platform_eng.image_suggestions_search_index_delta and injects the data into the search indices. If a DAG fails we write an empty partition, and the search team knows that's a noop.

Commons' threshold

If the amount of rows in Commons' delta/diff is above a threshold specified in ALIS' DAG variables, it won't be shipped.

Make sure you override the value if needed:

Cassandra TTL

The TTL for Cassandra data is 3 weeks, so if a pipeline has been failing for a while then the Cassandra data might just disappear.

You can reset the TTL as follows. Ensure that the cassandra DAG is paused before starting and is unpaused at the end.

  • Click on the DAG
  • Find the last successful run (green bar)
  • Click on the feed_suggestions square
  • Scroll down to check that the snapshot is correct
  • Add a note describing what you’re doing
  • Click on the green bar
  • Click Clear or SHIFT-c and confirm

ALIS and SLIS with different snapshots

If ALIS and SLIS have a different snapshot, e.g., 2025-01-20 and 2024-12-23 respectively, then the DAG must run twice. Instructions for ALIS:

  • Verify that the ALIS snapshot has zero SLIS
  • In the top tabs, go to Admin > Variables
  • Click on the Edit record icon on the left of platform_eng/dags/cassandra_dag.py Key
  • Fill weekly_snapshot with the correct snapshot and save
  • Go back to the cassandra DAG
  • Click on the green bar
  • Click Clear or SHIFT-c and confirm
  • Click on the wait_for_SLIS square
  • Click Mark state as... > failed or SHIFT-f and confirm

Then, repeat the same procedure for SLIS, making sure the snapshot has zero ALIS and that you fail wait_for_ALIS instead.

Production deployment

If you release a new version of a pipeline and bump relevant target DAGs, the change will be automatically deployed.

Make sure the conda environment DAG variable is wiped out, or Airflow won't pick up the new pipeline release:

Switch release

If you want to use a different pipeline release, override the "conda_env" value with the version you want to run and save.