Jump to content

Data Platform/Systems/Airflow/Kubernetes/Operations/K8s-Migration

From Wikitech

This page details the operations that were run to migrate airflow to Kubernetes

Migrating an existing instance

This section addresses how to perform a piecemeal migration of the airflow instances listed in Data Platform/Systems/Airflow/Instances to Kubernetes. The result of this migration is an airflow instance (webserver, scheduler and kerberos) all running in Kubernetes, alongside with the database itself, without any data loss.

The migration is done in 4 steps:

  • Migrate the webserver to Kubernetes
  • Migrate the scheduler and kerberos components to Kubernetes
  • Deploy a CloudnativePG cluster in the same Kubernetes namespace as Airlow and import the data
  • Cleanup

At the time of the writing, we have already migrated airflow-analytics-test, and we'll assume that this documentation covers the case of the airflow-search instance.

At the time of the writing, we have already migrated airflow-analytics-test, and we'll assume that this documentation covers the case of the airflow-search instance.

Migrating the webserver to Kubernetes

To only deploy the webserver to Kubernetes, we need to deploy airflow in a way that makes sure it talks to an external database, and opts out of the scheduler and kerberos components.

Prep work
  • The first thing you need to do is create Kubernetes read and deploy user credentials
  • Add a namespace (using the same name as the airflow instance) entry into deployment_charts/helmfile.d/admin_ng/values/dse-k8s.yaml
    namespaces:
      # ...
      airflow-search:
        deployClusterRole: deploy-airflow
        tlsExtraSANs:
          - airflow-search.wikimedia.org
    
  • Add the airflow-search namespace under the tenantNamespaces list in deployment_charts/helmfile.d/admin_ng/values/dse-k8s-eqiad/cephfs-csi-rbd-values.yaml as well as deployment_charts/helmfile.d/admin_ng/values/dse-k8s-eqiad/cephfs-csi-rbd-values.yaml
  • Add the airflow-search namespace to the watchedNamespaces list defined in deployment_charts/helmfile.d/admin_ng/values/dse-k8s-eqiad/cloudnative-pg.yaml
  • Deploy admin_ng
  • Create the public and internal DNS records for this instance
  • Create the airflow-search-ops LDAP group
  • Register the service in our IDP server (into idp.yaml). After the patch was merged and puppet ran on the idp servers, copy the OIDC secret key generated for the airflow service.
    root@idp1004:# cat /etc/cas/services/airflow_search-*.json  | jq -r .clientSecret
    <OIDC secret key>
    
Defining a secret key shared between the scheduler and the webserver

To be able to have the webserver running on Kubernetes fetch tasks logs from the scheduler running on an Airflow host, they need to share the same secret key. This means that we need to commit this secret key in a location that will be taken into account by Puppet, as well as another that will be taken into account by our Helm tooling.

First, generate a random string (64 characters long is good). Then commit that string in the 2 following locations, on puppetserver:

# /srv/git/private/hieradata/role/common/analytics_cluster/airflow/search.yaml
# warn: adapt the file path for each airflow instance
profile::airflow::instances_secrets:
  search:
    ...
    secret_key: <secret key>

Run puppet on the airflow instance, and make sure each airflow service is restarted.

Keep the secret key handy, it will be used in the next section. Copy the db_password value as well, you will need it for the next step.

Defining the webserver configuration

Add the following block in /srv/git/private/hieradata/role/common/deployment_server/kubernetes.yaml, on puppetserver.

dse-k8s:
  # ...
  airflow-search:
    dse-k8s-eqiad:
      config:
        private:
          airflow__webserver__secret_key: <secret key>
        airflow:
          postgresqlPass: <PG password from the previous section>
        oidc:
          client_secret: <OIDC secret key>
      kerberos:
        keytab: |
          <base64 representation of keytab>

Then, create a new airflow-search folder in deployment-charts/helmfile.d/dse-k8s-services (feel free to copy from deployment-charts/helmfile.d/dse-k8s-services/airflow-analytics-test ). Your values-production.yaml file should look like this:

config:
  airflow:
    dags_folder: search
    instance_name: search
    dbHost: an-db1001.eqiad.wmnet
    dbName: airflow_search
    dbUser: airflow_search
    auth:
      role_mappings:
        airflow-search-ops: [Op]
    config:
      logging:
        remote_logging: false
  oidc:
    client_id: airflow_search

external_services:
  webserver:
    postgresql: [analytics]
    airflow: [search]

ingress:
  gatewayHosts:
    default: "airflow-search"
    extraFQDNs:
    - airflow-search.wikimedia.org

kerberos:
  enabled: false

scheduler:
  remote_host: an-airflow-1xxx.eqiad.wmnet  # use the appropriate hostname from https://wikitech.wikimedia.org/wiki/Data_Platform/Systems/Airflow/Instances
  enabled: false

postgresql:
  cloudnative: false

Deploy airflow with helmfile.

Setup the ATS redirection

Take example from that patch to setup the appropriate redirection, to make the webUI visible to the public. Once merged, it should take about 30 minutes to fully take effect.

Migrate the scheduler and kerberos components to Kubernetes

  • Create the kerberos principals and the base64 representation of the instance keytab
# Change `analytics` to the UNIX user the airflow tasks will impersonate by default in Hadoop
brouberol@krb1001:~$ sudo kadmin.local addprinc -randkey analytics-search/airflow-search.discovery.wmnet@WIKIMEDIA
brouberol@krb1001:~$ sudo kadmin.local addprinc -randkey airflow/airflow-search.discovery.wmnet@WIKIMEDIA
brouberol@krb1001:~$ sudo kadmin.local addprinc -randkey HTTP/airflow-search.discovery.wmnet@WIKIMEDIA
brouberol@krb1001:~$ sudo kadmin.local ktadd -norandkey -k analytics-search.keytab \
    analytics-search/airflow-search.discovery.wmnet \
    airflow/airflow-search.discovery.wmnet@WIKIMEDIA \
    HTTP/airflow-search.discovery.wmnet@WIKIMEDIA
  • Copy the base64 representation of the generated keytab
  • Create the S3 user
    brouberol@cephosd1001:~$ sudo radosgw-admin user create --uid=airflow-search --display-name="airflow-search"
    # copy the access_key and secret_key
    
  • Add the following values to the values block in /srv/git/private/hieradata/role/common/deployment_server/kubernetes.yaml, on puppetserver.
    dse-k8s:
      # ...
      airflow-search:
        dse-k8s-eqiad:
          config:
            private:
              airflow__webserver__secret_key: <secret key>
            airflow:
              postgresqlPass: <PG password>
              aws_access_key_id: <S3 access key> # add this!
              aws_secret_access_key: <S3 secret key> # add this!
            oidc:
              client_secret: <OIDC secret key>
          kerberos:
            keytab: |
              <base64 representation of keytab> #  add this!
    
  • Create the S3 bucket
    brouberol@stat1008:~$ read access_key
    <S3 access_key>
    brouberol@stat1008:~$ read secret_key
    <S3 secret key>
    brouberol@stat1008:~$ s3cmd --access_key=$access_key --secret_key=$secret_key --host=rgw.eqiad.dpe.anycast.wmnet --region=dpe --host-bucket=no mb s3://logs.airflow-search.dse-k8s-eqiad
    
  • Sync all the scheduler and DAG task logs to S3
    brouberol@an-airflow1005:~$ tmux
    # in tmux
    brouberol@an-airflow1005:~$ sudo apt-get install s3cmd
    brouberol@an-airflow1005:~$ read access_key
    <S3 access_key>
    brouberol@an-airflow1005:~$ read secret_key
    <S3 secret_key>
    brouberol@an-airflow1005:~$ cd /srv/airflow-search/logs
    brouberol@an-airflow1005:/srv/airflow-search/logs$ s3cmd \
        --access_key=$access_key \
        --secret_key=$secret_key \
        --host=rgw.eqiad.dpe.anycast.wmnet \
        --region=dpe \
        --host-bucket=no \
        sync -r ./* s3://logs.airflow-search.dse-k8s-eqiad/
    ... # this will take a long time. Feel free to detach the tmux session
    
  • Once the logs are synchronized, stop all the airflow systemd services and sync the logs again, to account for the dags that might have run during the first sync. Make an announcement of Slack and IRC as this will prevent any DAG to run for a time.
    brouberol@an-airflow1005:~$ sudo puppet agent --disable "airflow scheduler migration to Kubernetes"
    brouberol@an-airflow1005:~$ sudo systemctl stop airflow-{webserver,kerberos,scheduler}@*.service
    brouberol@an-airflow1005:~$ cd /srv/airflow-search/logs
    brouberol@an-airflow1005:/srv/airflow-search/logs$ s3cmd \
        --access_key=$access_key \
        --secret_key=$secret_key \
        --host=rgw.eqiad.dpe.anycast.wmnet \
        --region=dpe \
        --host-bucket=no \
        sync -r ./* s3://logs.airflow-search.dse-k8s-eqiad/
    ... # this time, this should be fairly short
    
  • Deploy the following helmfile.d/dse-k8s-services/airflow-search/values-production.yaml configuration:
    config:
      airflow:
        dags_folder: search
        instance_name: search
        dbHost: an-db1001.eqiad.wmnet
        dbName: airflow_search
        dbUser: airflow_search
        auth:
          role_mappings:
            airflow-search-ops: [Op]
        config:
          core:
            executor: KubernetesExecutor
          kerberos:
            principal: analytics-search/airflow-search.discovery.wmnet
      oidc:
        client_id: airflow_search
    
    external_services:
      webserver:
        postgresql: [analytics]
    
    ingress:
      gatewayHosts:
        default: "airflow-search"
        extraFQDNs:
        - airflow-search.wikimedia.org
    
    postgresql:
      cloudnative: false
    
  • Deploy with helmfile. You should see the airflow-kerberos and airflow-scheduler pods appear. Once the deployment goes through, connect to https://airflow-search.wikimedia.org and execute DAGs, to make sure they run correctly. If they do not, well, the fun begins. There's no real playbook there, as there's no way to tell what will err in advance. Whether a network policy might be missing, or a patch might need to be submitted to airflow-dags.. Roll the dice!
  • Once everything is working, submit a puppet patch that adds services_ensure: absent under each instance in the profile::airflow::instances hieradata associated with the airflow instance.

Deploy a CloudnativePG cluster in the same Kubernetes namespace as Airflow and import the data

Same as the previous section, this migration instruction guide will assume that we're migrating the airflow-search instance. Replace names accordingly.
  • Create the postgresql-airflow-search S3 user. Copy the access and secret keys from the output.
    brouberol@cephosd1001:~$ sudo radosgw-admin user create --uid=postgresql-airflow-search --display-name="postgresql-airflow-search"
    
  • Create the S3 bucket in which the PG data will be stored.
    brouberol@stat1008:~$ read access_key
    REDACTED
    brouberol@stat1008:~$ read secret_key
    REDACTED
    brouberol@stat1008:~$ s3cmd --access_key=$access_key --secret_key=$secret_key --host=rgw.eqiad.dpe.anycast.wmnet --region=dpe --host-bucket=no mb s3://postgresql-airflow-search.dse-k8s-eqiad
    Bucket 's3://postgresql-airflow-search.dse-k8s-eqiad/' created
    
  • Add the S3 keys to the private secret repository, into hieradata/role/common/deployment_server/kubernetes.yaml
    ...
        postgresql-airflow-search:
          dse-k8s-eqiad:
            s3:
              accessKey: <S3 access key>
              secretKey: <S3 secret key>
            cluster:
              initdb:
                import:
                  password: <PG password>
    ...
    
  • Add the airflow-search namespace to the list of cloudnative-pg tenant namespaces, in helmfile.d/admin_ng/values/dse-k8s-eqiad/cloudnative-pg-values.yaml. Deploy admin_ng.
  • Uncomment the sections related to postgresql/cloudnative-pg from the airflow app helmfile, and add a values-postgresql-airflow-search.yaml containing the following data
    cluster:
      initdb:
        import:
          host: an-db1001.eqiad.wmnet
          user: airflow_search
          dbname: airflow_search
          
    external_services:
      postgresql: [analytics]
    
  • Before deploying, edit the airflow webserver and scheduler deployment to 0 replicas. This will induce a downtime for Airflow, so make sure to reach out to the team beforehand.
  • For airflow-analytics only, perform a backup of the airflow_analytics DB
    brouberol@an-db1001:~$ sudo -i -u postgres
    postgres@an-db1001:~$ pg_dump airflow_analytics > "airflow_analytics_$(date -I).sql"
    
  • For airflow-analytics only, drop any log and job data older than 365 days (see https://phabricator.wikimedia.org/T380614) by ssh-ing to an-client1002.eqiad.wmnet and running
    brouberol@an-launcher1002:~$ sudo -u analytics /srv/airflow-analytics/bin/airflow-analytics db clean --tables log,job --clean-before-timestamp $(date -d "$(date '+%Y-%m-%d') - 365 days" '+%Y-%m-%d %H:%M:%S')
    
  • Deploy by running helmfile -e dse-k8s-eqiad --selector 'name=postgresql-airflow-search' apply to only deploy the cloudnative PG pods.
  • Remove all PG-related secrets from the private puppet repository.
    ...
        postgresql-airflow-search:
          dse-k8s-eqiad:
            s3:
              accessKey: REDACTED
              secretKey: REDACTED
    -       cluster:
    -         initdb:
    -           import:
    -             password: REDACTED
    
        airflow-search:
          dse-k8s-eqiad:
            config:
              private:
                airflow__webserver__secret_key: REDACTED
              airflow:
    -           postgresqlPass: REDACTED
                aws_access_key_id: REDACTED
                aws_secret_access_key: REDACTED
    ...
    
  • Once all pods are healthy, empty the values-postgresql-airflow-search.yaml file of all values, leaving it completely empty. Redeploy with helmfile -e dse-k8s-eqiad --selector 'name=postgresql-airflow-search' apply , which shouldn't restart any pod.
  • Apply this diff to the values-production.yaml file
     config:
       airflow:
         dags_folder: search
         instance_name: search
    -    dbHost: an-db1001.eqiad.wmnet
    -    dbName: airflow_search
    -    dbUser: airflow_search
         auth:
           role_mappings:
             airflow-analytics-search-ops: [Op]
    @@ -16,18 +13,12 @@ config:
       oidc:
         client_id: airflow_search
    
    -external_services:
    -  postgresql: [analytics]
    -
     ingress:
       gatewayHosts:
         default: "airflow-search"
         extraFQDNs:
         - airflow-search.wikimedia.org
    
    -postgresql:
    -  cloudnative: false
    
  • Run helmfile -e dse-k8s-eqiad --selector 'name=production' apply to redeploy airflow, that will now connect to the PGBouncer pods of the cloudnative PG cluster.
  • For airflow-analytics only, delete the backup (taken in a previous step) from an-db1001.