Data Platform/Systems/Airflow/Kubernetes/Operations/K8s-Migration

This page details the operations that were run to migrate airflow to Kubernetes

Migrating an existing instance

This section addresses how to perform a piecemeal migration of the airflow instances listed in Data Platform/Systems/Airflow/Instances to Kubernetes. The result of this migration is an airflow instance (webserver, scheduler and kerberos) all running in Kubernetes, alongside with the database itself, without any data loss.

The migration is done in 4 steps:

Migrate the webserver to Kubernetes
Migrate the scheduler and kerberos components to Kubernetes
Deploy a CloudnativePG cluster in the same Kubernetes namespace as Airlow and import the data
Cleanup

At the time of the writing, we have already migrated airflow-analytics-test, and we'll assume that this documentation covers the case of the airflow-search instance.

At the time of the writing, we have already migrated airflow-analytics-test, and we'll assume that this documentation covers the case of the airflow-search instance.

Migrating the webserver to Kubernetes

To only deploy the webserver to Kubernetes, we need to deploy airflow in a way that makes sure it talks to an external database, and opts out of the scheduler and kerberos components.

Prep work

The first thing you need to do is create Kubernetes read and deploy user credentials

Add a namespace (using the same name as the airflow instance) entry into deployment_charts/helmfile.d/admin_ng/values/dse-k8s.yaml

namespaces:
  # ...
  airflow-search:
    deployClusterRole: deploy-airflow
    tlsExtraSANs:
      - airflow-search.wikimedia.org

Add the airflow-search namespace under the tenantNamespaces list in deployment_charts/helmfile.d/admin_ng/values/dse-k8s-eqiad/cephfs-csi-rbd-values.yaml as well as deployment_charts/helmfile.d/admin_ng/values/dse-k8s-eqiad/cephfs-csi-rbd-values.yaml
Add the airflow-search namespace to the watchedNamespaces list defined in deployment_charts/helmfile.d/admin_ng/values/dse-k8s-eqiad/cloudnative-pg.yaml
Deploy admin_ng
Create the public and internal DNS records for this instance
Create the airflow-search-ops LDAP group
Register the service in our IDP server (into idp.yaml). After the patch was merged and puppet ran on the idp servers, copy the OIDC secret key generated for the airflow service.
```
root@idp1004:# cat /etc/cas/services/airflow_search-*.json  | jq -r .clientSecret
<OIDC secret key>
```

Defining a secret key shared between the scheduler and the webserver

To be able to have the webserver running on Kubernetes fetch tasks logs from the scheduler running on an Airflow host, they need to share the same secret key. This means that we need to commit this secret key in a location that will be taken into account by Puppet, as well as another that will be taken into account by our Helm tooling.

First, generate a random string (64 characters long is good). Then commit that string in the 2 following locations, on puppetserver:

# /srv/git/private/hieradata/role/common/analytics_cluster/airflow/search.yaml
# warn: adapt the file path for each airflow instance
profile::airflow::instances_secrets:
  search:
    ...
    secret_key: <secret key>

Run puppet on the airflow instance, and make sure each airflow service is restarted.

Keep the secret key handy, it will be used in the next section. Copy the db_password value as well, you will need it for the next step.

Defining the webserver configuration

Add the following block in /srv/git/private/hieradata/role/common/deployment_server/kubernetes.yaml, on puppetserver.

dse-k8s:
  # ...
  airflow-search:
    dse-k8s-eqiad:
      config:
        private:
          airflow__webserver__secret_key: <secret key>
        airflow:
          postgresqlPass: <PG password from the previous section>
        oidc:
          client_secret: <OIDC secret key>
      kerberos:
        keytab: |
          <base64 representation of keytab>

Then, create a new airflow-search folder in deployment-charts/helmfile.d/dse-k8s-services (feel free to copy from deployment-charts/helmfile.d/dse-k8s-services/airflow-analytics-test ). Your values-production.yaml file should look like this:

config:
  airflow:
    dags_folder: search
    instance_name: search
    dbHost: an-db1001.eqiad.wmnet
    dbName: airflow_search
    dbUser: airflow_search
    auth:
      role_mappings:
        airflow-search-ops: [Op]
    config:
      logging:
        remote_logging: false
  oidc:
    client_id: airflow_search

external_services:
  webserver:
    postgresql: [analytics]
    airflow: [search]

ingress:
  gatewayHosts:
    default: "airflow-search"
    extraFQDNs:
    - airflow-search.wikimedia.org

kerberos:
  enabled: false

scheduler:
  remote_host: an-airflow-1xxx.eqiad.wmnet  # use the appropriate hostname from https://wikitech.wikimedia.org/wiki/Data_Platform/Systems/Airflow/Instances
  enabled: false

postgresql:
  cloudnative: false

Deploy airflow with helmfile.

Setup the ATS redirection

Take example from that patch to setup the appropriate redirection, to make the webUI visible to the public. Once merged, it should take about 30 minutes to fully take effect.

Migrate the scheduler and kerberos components to Kubernetes

Create the kerberos principals and the base64 representation of the instance keytab

# Change `analytics` to the UNIX user the airflow tasks will impersonate by default in Hadoop
brouberol@krb1001:~$ sudo kadmin.local addprinc -randkey analytics-search/airflow-search.discovery.wmnet@WIKIMEDIA
brouberol@krb1001:~$ sudo kadmin.local addprinc -randkey airflow/airflow-search.discovery.wmnet@WIKIMEDIA
brouberol@krb1001:~$ sudo kadmin.local addprinc -randkey HTTP/airflow-search.discovery.wmnet@WIKIMEDIA
brouberol@krb1001:~$ sudo kadmin.local ktadd -norandkey -k analytics-search.keytab \
    analytics-search/airflow-search.discovery.wmnet \
    airflow/airflow-search.discovery.wmnet@WIKIMEDIA \
    HTTP/airflow-search.discovery.wmnet@WIKIMEDIA

Copy the base64 representation of the generated keytab

Create the S3 user

brouberol@cephosd1001:~$ sudo radosgw-admin user create --uid=airflow-search --display-name="airflow-search"
# copy the access_key and secret_key

Add the following values to the values block in /srv/git/private/hieradata/role/common/deployment_server/kubernetes.yaml, on puppetserver.

dse-k8s:
  # ...
  airflow-search:
    dse-k8s-eqiad:
      config:
        private:
          airflow__webserver__secret_key: <secret key>
        airflow:
          postgresqlPass: <PG password>
          aws_access_key_id: <S3 access key> # add this!
          aws_secret_access_key: <S3 secret key> # add this!
        oidc:
          client_secret: <OIDC secret key>
      kerberos:
        keytab: |
          <base64 representation of keytab> #  add this!

Create the S3 bucket

brouberol@stat1008:~$ read access_key
<S3 access_key>
brouberol@stat1008:~$ read secret_key
<S3 secret key>
brouberol@stat1008:~$ s3cmd --access_key=$access_key --secret_key=$secret_key --host=rgw.eqiad.dpe.anycast.wmnet --region=dpe --host-bucket=no mb s3://logs.airflow-search.dse-k8s-eqiad

Sync all the scheduler and DAG task logs to S3

brouberol@an-airflow1005:~$ tmux
# in tmux
brouberol@an-airflow1005:~$ sudo apt-get install s3cmd
brouberol@an-airflow1005:~$ read access_key
<S3 access_key>
brouberol@an-airflow1005:~$ read secret_key
<S3 secret_key>
brouberol@an-airflow1005:~$ cd /srv/airflow-search/logs
brouberol@an-airflow1005:/srv/airflow-search/logs$ s3cmd \
    --access_key=$access_key \
    --secret_key=$secret_key \
    --host=rgw.eqiad.dpe.anycast.wmnet \
    --region=dpe \
    --host-bucket=no \
    sync -r ./* s3://logs.airflow-search.dse-k8s-eqiad/
... # this will take a long time. Feel free to detach the tmux session

Once the logs are synchronized, stop all the airflow systemd services and sync the logs again, to account for the dags that might have run during the first sync. Make an announcement of Slack and IRC as this will prevent any DAG to run for a time.

brouberol@an-airflow1005:~$ sudo puppet agent --disable "airflow scheduler migration to Kubernetes"
brouberol@an-airflow1005:~$ sudo systemctl stop airflow-{webserver,kerberos,scheduler}@*.service
brouberol@an-airflow1005:~$ cd /srv/airflow-search/logs
brouberol@an-airflow1005:/srv/airflow-search/logs$ s3cmd \
    --access_key=$access_key \
    --secret_key=$secret_key \
    --host=rgw.eqiad.dpe.anycast.wmnet \
    --region=dpe \
    --host-bucket=no \
    sync -r ./* s3://logs.airflow-search.dse-k8s-eqiad/
... # this time, this should be fairly short

Deploy the following helmfile.d/dse-k8s-services/airflow-search/values-production.yaml configuration:

config:
  airflow:
    dags_folder: search
    instance_name: search
    dbHost: an-db1001.eqiad.wmnet
    dbName: airflow_search
    dbUser: airflow_search
    auth:
      role_mappings:
        airflow-search-ops: [Op]
    config:
      core:
        executor: KubernetesExecutor
      kerberos:
        principal: analytics-search/airflow-search.discovery.wmnet
  oidc:
    client_id: airflow_search

external_services:
  webserver:
    postgresql: [analytics]

ingress:
  gatewayHosts:
    default: "airflow-search"
    extraFQDNs:
    - airflow-search.wikimedia.org

postgresql:
  cloudnative: false

Deploy with helmfile. You should see the airflow-kerberos and airflow-scheduler pods appear. Once the deployment goes through, connect to https://airflow-search.wikimedia.org and execute DAGs, to make sure they run correctly. If they do not, well, the fun begins. There's no real playbook there, as there's no way to tell what will err in advance. Whether a network policy might be missing, or a patch might need to be submitted to airflow-dags.. Roll the dice!
Once everything is working, submit a puppet patch that adds services_ensure: absent under each instance in the profile::airflow::instances hieradata associated with the airflow instance.

Deploy a CloudnativePG cluster in the same Kubernetes namespace as Airflow and import the data

Same as the previous section, this migration instruction guide will assume that we're migrating the airflow-search instance. Replace names accordingly.

Create the postgresql-airflow-search S3 user. Copy the access and secret keys from the output.

brouberol@cephosd1001:~$ sudo radosgw-admin user create --uid=postgresql-airflow-search --display-name="postgresql-airflow-search"

Create the S3 bucket in which the PG data will be stored.

brouberol@stat1008:~$ read access_key
REDACTED
brouberol@stat1008:~$ read secret_key
REDACTED
brouberol@stat1008:~$ s3cmd --access_key=$access_key --secret_key=$secret_key --host=rgw.eqiad.dpe.anycast.wmnet --region=dpe --host-bucket=no mb s3://postgresql-airflow-search.dse-k8s-eqiad
Bucket 's3://postgresql-airflow-search.dse-k8s-eqiad/' created

Add the S3 keys to the private secret repository, into hieradata/role/common/deployment_server/kubernetes.yaml

...
    postgresql-airflow-search:
      dse-k8s-eqiad:
        s3:
          accessKey: <S3 access key>
          secretKey: <S3 secret key>
        cluster:
          initdb:
            import:
              password: <PG password>
...

Add the airflow-search namespace to the list of cloudnative-pg tenant namespaces, in helmfile.d/admin_ng/values/dse-k8s-eqiad/cloudnative-pg-values.yaml. Deploy admin_ng.

Uncomment the sections related to postgresql/cloudnative-pg from the airflow app helmfile, and add a values-postgresql-airflow-search.yaml containing the following data

cluster:
  initdb:
    import:
      host: an-db1001.eqiad.wmnet
      user: airflow_search
      dbname: airflow_search
      
external_services:
  postgresql: [analytics]

Before deploying, edit the airflow webserver and scheduler deployment to 0 replicas. This will induce a downtime for Airflow, so make sure to reach out to the team beforehand.

For airflow-analytics only, perform a backup of the airflow_analytics DB

brouberol@an-db1001:~$ sudo -i -u postgres
postgres@an-db1001:~$ pg_dump airflow_analytics > "airflow_analytics_$(date -I).sql"

For airflow-analytics only, drop any log and job data older than 365 days (see https://phabricator.wikimedia.org/T380614) by ssh-ing to an-client1002.eqiad.wmnet and running

brouberol@an-launcher1002:~$ sudo -u analytics /srv/airflow-analytics/bin/airflow-analytics db clean --tables log,job --clean-before-timestamp $(date -d "$(date '+%Y-%m-%d') - 365 days" '+%Y-%m-%d %H:%M:%S')

Deploy by running helmfile -e dse-k8s-eqiad --selector 'name=postgresql-airflow-search' apply to only deploy the cloudnative PG pods.

Remove all PG-related secrets from the private puppet repository.

...
    postgresql-airflow-search:
      dse-k8s-eqiad:
        s3:
          accessKey: REDACTED
          secretKey: REDACTED
-       cluster:
-         initdb:
-           import:
-             password: REDACTED

    airflow-search:
      dse-k8s-eqiad:
        config:
          private:
            airflow__webserver__secret_key: REDACTED
          airflow:
-           postgresqlPass: REDACTED
            aws_access_key_id: REDACTED
            aws_secret_access_key: REDACTED
...

Once all pods are healthy, empty the values-postgresql-airflow-search.yaml file of all values, leaving it completely empty. Redeploy with helmfile -e dse-k8s-eqiad --selector 'name=postgresql-airflow-search' apply , which shouldn't restart any pod.

Apply this diff to the values-production.yaml file

 config:
   airflow:
     dags_folder: search
     instance_name: search
-    dbHost: an-db1001.eqiad.wmnet
-    dbName: airflow_search
-    dbUser: airflow_search
     auth:
       role_mappings:
         airflow-analytics-search-ops: [Op]
@@ -16,18 +13,12 @@ config:
   oidc:
     client_id: airflow_search

-external_services:
-  postgresql: [analytics]
-
 ingress:
   gatewayHosts:
     default: "airflow-search"
     extraFQDNs:
     - airflow-search.wikimedia.org

-postgresql:
-  cloudnative: false

Run helmfile -e dse-k8s-eqiad --selector 'name=production' apply to redeploy airflow, that will now connect to the PGBouncer pods of the cloudnative PG cluster.
For airflow-analytics only, delete the backup (taken in a previous step) from an-db1001.