Jump to content

Data Platform/Systems/Airflow/Developer guide/Normalize a DAG

From Wikitech

This will serve as a guide to take an existing DAG and update it with the latest Best Practices of our Airflow Platform.

  • Create a new env to run tests, see how to here
    • pytest runs unit tests, if it runs cleanly you have a working environment
Create your branch
Testing the DAG
  • Push changes to a new branch
  • Login into your preferred stat box
  • Clone the airflow-dags repository and checkout your branch
  • Confirm you are working on your branch and you can see your changes
  • Configure the DAG for testing
    • Set the DagProperties or VarProperties to read from production data but write to test outputs
      • in some cases you may need to create test tables that mirror the production ones
      • if the job you are testing goes over a lot of data, like a month, it may be appropriate to limit the input to speed up testing
    • use /user/<<your-user>> as the output folder base path in HDFS
  • from the airflow-dags folder run
kinit
./run_dev_instance.sh -p 8080 analytics
  • This takes a couple of minutes the first time, follow the instructions
    • Set up the SSH Tunnel
    • Access the Airflow UI
  • Turn on the DAG in the UI, it should automatically run with your settings from above. You can check the "Rendered Template" of each task to see what property values it is using.
PYTEST
  • Run pytest again
  • Go through fixtures and verify they make sense
  • Then rebuild fixtures by adding the keyword
find tests -name \*.expected -exec rm '{}' \;
REBUILD_FIXTURES=yes pytest
Check Data
  • compare your version of data with the official production version