Jump to content

Data Platform/Systems/Airflow/Kubernetes

From Wikitech

This page relates specifically to our Airflow instances deployed to Kubernetes, and their specificity. We assume that the airflow instance is deployed alongisde a dedicated CloudnativePG cluster, running in the same namespace.

Creating a new instance

In the following section, we're going to assume that we're creating a new airflow instance named `airflow-test-k8s`, deployed with a dedicated PG cluster named `postgresql-airflow-test-k8s`, deployed in the `dse-k8s-eqiad` Kubernetes environment.
  • The first thing you need to do is create Kubernetes read and deploy user credentials
  • Add a namespace (using the same name as the airflow instance) entry into deployment_charts/helmfile.d/admin_ng/values/dse-k8s.yaml
    namespaces:
      # ...
      airflow-test-k8s:
        tlsExtraSANs:
          - airflow-test-k8s.wikimedia.org
    
  • Then, create the public and internal DNS records for this instance
  • Define the airflow instance helmfile.yaml file and associated values (take example from deployment_charts/helmfile.d/dse-k8s-services/airflow-test-k8s)
  • Generate the S3 keypairs for both PG and Airflow
    brouberol@cephosd1001:~$ sudo radosgw-admin user create --uid=postgresql-airflow-test-k8s --display-name="postgresql-airflow-test-k8s"
    brouberol@cephosd1001:~$ sudo radosgw-admin user create --uid=airflow-test-k8s --display-name="airflow-test-k8s"
    # note: copy the `access_key` and `secret_key` from the JSON output, you will need it in the next step
    
  • Create the S3 buckets for both PG and Airflow
    brouberol@stat1008:~$ read access_key
    <PG S3 access key>
    brouberol@stat1008:~$ read secret_key
    <PG S3 secret key>
    brouberol@stat1008:~$ s3cmd --access_key=$access_key --secret_key=$secret_key --host=rgw.eqiad.dpe.anycast.wmnet --bucket-location=dpe --host-bucket=n mb s3://postgresql.airflow-test-k8s.dse-k8s-eqiad/
    brouberol@stat1008:~$ read access_key
    <Airflow S3 access key>
    brouberol@stat1008:~$ read secret_key
    <Airflow S3 secret key>
    brouberol@stat1008:~$ s3cmd --access_key=$access_key --secret_key=$secret_key --host=rgw.eqiad.dpe.anycast.wmnet --bucket-location=dpe --host-bucket=n mb s3://logs.airflow-test-k8s.dse-k8s-eqiad/
    
  • Register the service in our IDP server (into idp.yaml). After the patch was merge and puppet ran on the idp servers, copy the OIDC secret key generated for the airflow service.
    root@idp1004:# cat /etc/cas/services/airflow_test_k8s-*.json  | jq -r .clientSecret
    <OIDC secret key>
    
  • Generate the secrets or both the PG cluster and the Airflow instance and add the to the private puppet repository, to /srv/git/private/hieradata/role/common/deployment_server/kubernetes.yaml
    dse-k8s:
        # ...
        postgresql-airflow-test-k8s:
          dse-k8s-eqiad:
            s3:
              accessKey: <PG S3 access key>
              secretKey: <PG S3 secret key>
    
        airflow-test-k8s:
          dse-k8s-eqiad:
            config:
              private:
                airflow__core__fernet_key: <random 64 characters>
                airflow__webserver__secret_key: <random 64 characters>
              airflow:
                aws_access_key_id: <Airflow S3 access key>
                aws_secret_access_key: <Airflow S3 secret key>
              oidc:
                client_secret: <OIDC secret key>
    
  • Deploy the service (which should deploy both the PG cluster and the airflow instance)
  • Once the instance is running, enable the ATS redirection from the wikimedia.org subdomain to the kube ingress. After puppet has run on all the cache servers (wait a good 30 minutes), https://airflow-test-k8s.wikimedia.org should display the airflow web UI, and you should be able to connect via CAS.

Configuring out-of-band backups

The PostgreSQL database cluster for this instance will already be configured with its own backup system that writes database backups and WAL archives to the S3 interface of the Ceph cluster.

However, we decided to implement out-of-band backups of each of the S3 buckets containing these database backups, so we added a new backup pipeline to our database backup replica system, which is db1208.

In this case the file you need to modify when you add a new instance is in the private repo and is named: hieradata/role/common/mariadb/misc/analytics/backup.yaml

Add your new bucket and its access credentials to the profile::ceph::backup::s3_local::sources hash structure, as shown.

profile::ceph::backup::s3_local::sources:
  postgresql.airflow-test-k8s.dse-k8s-eqiad:
    access_key: <Airflow S3 access key>
    secret_key: <Airflow S3 secret key>

When merged, this will update the file /srv/postgresql_backups/rclone.conf on db1208, adding the backups of this database cluster to the daily sync process and therefore to Bacula.

Upgrading Airflow

To upgrade Airflow, we first need to rebuild a new docker image installing on a more recent apache-airflow package version (example). Once the patch is merged, a publish:airflow job will be kicked off. Once it completes, go the end of it logs and copy the full tag and digest of the newly published airflow image (ex:2024-09-10-155931-16267fd457b14a196911d4100b94f26d0467d510@sha256:4de546f25b3901410f11b9e36e506a9c7e5b9bd750cf6b6b2e91b16afd75daad)

Now, deploy the new image to the airflow-test-k8s instance, by changing the app.version field in deployment_charts/helmfile.d/dse-k8s-services/airflow-test-k8s/values-production.yaml, and redeploy the test instance. Any outstanding DB migrations will automatically be applied. If everything goes well, bump the airflow version under deployment_charts/helmfile.d/dse-k8s-services/_airflow_common_/values-dse-k8s-eqiad.yaml, and redeploy every instance, one after the other.