Data Platform/Systems/Airflow/Kubernetes/Operations/K8s-Migration
This page details the operations that were run to migrate airflow to Kubernetes
Migrating an existing instance
This section addresses how to perform a piecemeal migration of the airflow instances listed in Data Platform/Systems/Airflow/Instances to Kubernetes. The result of this migration is an airflow instance (webserver, scheduler and kerberos) all running in Kubernetes, alongside with the database itself, without any data loss.
The migration is done in 4 steps:
- Migrate the webserver to Kubernetes
- Migrate the scheduler and kerberos components to Kubernetes
- Deploy a CloudnativePG cluster in the same Kubernetes namespace as Airlow and import the data
- Cleanup
At the time of the writing, we have already migrated airflow-analytics-test
, and we'll assume that this documentation covers the case of the airflow-search
instance.
Migrating the webserver to Kubernetes
To only deploy the webserver to Kubernetes, we need to deploy airflow in a way that makes sure it talks to an external database, and opts out of the scheduler and kerberos components.
Prep work
- The first thing you need to do is create Kubernetes read and deploy user credentials
- Add a namespace (using the same name as the airflow instance) entry into
deployment_charts/helmfile.d/admin_ng/values/dse-k8s.yaml
namespaces: # ... airflow-search: deployClusterRole: deploy-airflow tlsExtraSANs: - airflow-search.wikimedia.org
- Add the
airflow-search
namespace under thetenantNamespaces
list indeployment_charts/helmfile.d/admin_ng/values/dse-k8s-eqiad/cephfs-csi-rbd-values.yaml
as well asdeployment_charts/helmfile.d/admin_ng/values/dse-k8s-eqiad/cephfs-csi-rbd-values.yaml
- Add the
airflow-search
namespace to thewatchedNamespaces
list defined indeployment_charts/helmfile.d/admin_ng/values/dse-k8s-eqiad/cloudnative-pg.yaml
- Deploy
admin_ng
- Create the public and internal DNS records for this instance
- Create the
airflow-search-ops
LDAP group - Register the service in our IDP server (into
idp.yaml
). After the patch was merged and puppet ran on the idp servers, copy the OIDC secret key generated for the airflow service.root@idp1004:# cat /etc/cas/services/airflow_search-*.json | jq -r .clientSecret <OIDC secret key>
Defining a secret key shared between the scheduler and the webserver
To be able to have the webserver running on Kubernetes fetch tasks logs from the scheduler running on an Airflow host, they need to share the same secret key. This means that we need to commit this secret key in a location that will be taken into account by Puppet, as well as another that will be taken into account by our Helm tooling.
First, generate a random string (64 characters long is good). Then commit that string in the 2 following locations, on puppetserver
:
# /srv/git/private/hieradata/role/common/analytics_cluster/airflow/search.yaml
# warn: adapt the file path for each airflow instance
profile::airflow::instances_secrets:
search:
...
secret_key: <secret key>
Run puppet on the airflow instance, and make sure each airflow service is restarted.
Keep the secret key handy, it will be used in the next section. Copy the db_password
value as well, you will need it for the next step.
Defining the webserver configuration
Add the following block in /srv/git/private/hieradata/role/common/deployment_server/kubernetes.yaml
, on puppetserver
.
dse-k8s:
# ...
airflow-search:
dse-k8s-eqiad:
config:
private:
airflow__webserver__secret_key: <secret key>
airflow:
postgresqlPass: <PG password from the previous section>
oidc:
client_secret: <OIDC secret key>
kerberos:
keytab: |
<base64 representation of keytab>
Then, create a new airflow-search
folder in deployment-charts/helmfile.d/dse-k8s-services
(feel free to copy from deployment-charts/helmfile.d/dse-k8s-services/airflow-analytics-test
). Your values-production.yaml
file should look like this:
config:
airflow:
dags_folder: search
instance_name: search
dbHost: an-db1001.eqiad.wmnet
dbName: airflow_search
dbUser: airflow_search
auth:
role_mappings:
airflow-search-ops: [Op]
config:
logging:
remote_logging: false
oidc:
client_id: airflow_search
external_services:
webserver:
postgresql: [analytics]
airflow: [search]
ingress:
gatewayHosts:
default: "airflow-search"
extraFQDNs:
- airflow-search.wikimedia.org
kerberos:
enabled: false
scheduler:
remote_host: an-airflow-1xxx.eqiad.wmnet # use the appropriate hostname from https://wikitech.wikimedia.org/wiki/Data_Platform/Systems/Airflow/Instances
enabled: false
postgresql:
cloudnative: false
Deploy airflow with helmfile.
Setup the ATS redirection
Take example from that patch to setup the appropriate redirection, to make the webUI visible to the public. Once merged, it should take about 30 minutes to fully take effect.
Migrate the scheduler and kerberos components to Kubernetes
- Create the kerberos principals and the base64 representation of the instance keytab
# Change `analytics` to the UNIX user the airflow tasks will impersonate by default in Hadoop
brouberol@krb1001:~$ sudo kadmin.local addprinc -randkey analytics-search/airflow-search.discovery.wmnet@WIKIMEDIA
brouberol@krb1001:~$ sudo kadmin.local addprinc -randkey airflow/airflow-search.discovery.wmnet@WIKIMEDIA
brouberol@krb1001:~$ sudo kadmin.local addprinc -randkey HTTP/airflow-search.discovery.wmnet@WIKIMEDIA
brouberol@krb1001:~$ sudo kadmin.local ktadd -norandkey -k analytics-search.keytab \
analytics-search/airflow-search.discovery.wmnet \
airflow/airflow-search.discovery.wmnet@WIKIMEDIA \
HTTP/airflow-search.discovery.wmnet@WIKIMEDIA
- Copy the base64 representation of the generated keytab
- Create the S3 user
brouberol@cephosd1001:~$ sudo radosgw-admin user create --uid=airflow-search --display-name="airflow-search" # copy the access_key and secret_key
- Add the following values to the values block in
/srv/git/private/hieradata/role/common/deployment_server/kubernetes.yaml
, onpuppetserver
.dse-k8s: # ... airflow-search: dse-k8s-eqiad: config: private: airflow__webserver__secret_key: <secret key> airflow: postgresqlPass: <PG password> aws_access_key_id: <S3 access key> # add this! aws_secret_access_key: <S3 secret key> # add this! oidc: client_secret: <OIDC secret key> kerberos: keytab: | <base64 representation of keytab> # add this!
- Create the S3 bucket
brouberol@stat1008:~$ read access_key <S3 access_key> brouberol@stat1008:~$ read secret_key <S3 secret key> brouberol@stat1008:~$ s3cmd --access_key=$access_key --secret_key=$secret_key --host=rgw.eqiad.dpe.anycast.wmnet --region=dpe --host-bucket=no mb s3://logs.airflow-search.dse-k8s-eqiad
- Sync all the scheduler and DAG task logs to S3
brouberol@an-airflow1005:~$ tmux # in tmux brouberol@an-airflow1005:~$ sudo apt-get install s3cmd brouberol@an-airflow1005:~$ read access_key <S3 access_key> brouberol@an-airflow1005:~$ read secret_key <S3 secret_key> brouberol@an-airflow1005:~$ cd /srv/airflow-search/logs brouberol@an-airflow1005:/srv/airflow-search/logs$ s3cmd \ --access_key=$access_key \ --secret_key=$secret_key \ --host=rgw.eqiad.dpe.anycast.wmnet \ --region=dpe \ --host-bucket=no \ sync -r ./* s3://logs.airflow-search.dse-k8s-eqiad/ ... # this will take a long time. Feel free to detach the tmux session
- Once the logs are synchronized, stop all the airflow systemd services and sync the logs again, to account for the dags that might have run during the first sync. Make an announcement of Slack and IRC as this will prevent any DAG to run for a time.
brouberol@an-airflow1005:~$ sudo puppet agent --disable "airflow scheduler migration to Kubernetes" brouberol@an-airflow1005:~$ sudo systemctl stop airflow-{webserver,kerberos,scheduler}@*.service brouberol@an-airflow1005:~$ cd /srv/airflow-search/logs brouberol@an-airflow1005:/srv/airflow-search/logs$ s3cmd \ --access_key=$access_key \ --secret_key=$secret_key \ --host=rgw.eqiad.dpe.anycast.wmnet \ --region=dpe \ --host-bucket=no \ sync -r ./* s3://logs.airflow-search.dse-k8s-eqiad/ ... # this time, this should be fairly short
- Deploy the following
helmfile.d/dse-k8s-services/airflow-search/values-production.yaml
configuration:config: airflow: dags_folder: search instance_name: search dbHost: an-db1001.eqiad.wmnet dbName: airflow_search dbUser: airflow_search auth: role_mappings: airflow-search-ops: [Op] config: core: executor: KubernetesExecutor kerberos: principal: analytics-search/airflow-search.discovery.wmnet oidc: client_id: airflow_search external_services: webserver: postgresql: [analytics] ingress: gatewayHosts: default: "airflow-search" extraFQDNs: - airflow-search.wikimedia.org postgresql: cloudnative: false
- Deploy with
helmfile
. You should see theairflow-kerberos
andairflow-scheduler
pods appear. Once the deployment goes through, connect to https://airflow-search.wikimedia.org and execute DAGs, to make sure they run correctly. If they do not, well, the fun begins. There's no real playbook there, as there's no way to tell what will err in advance. Whether a network policy might be missing, or a patch might need to be submitted toairflow-dags
.. Roll the dice! - Once everything is working, submit a puppet patch that adds
services_ensure: absent
under each instance in theprofile::airflow::instances
hieradata associated with the airflow instance.
Deploy a CloudnativePG cluster in the same Kubernetes namespace as Airflow and import the data
- Create the
postgresql-airflow-search
S3 user. Copy the access and secret keys from the output.brouberol@cephosd1001:~$ sudo radosgw-admin user create --uid=postgresql-airflow-search --display-name="postgresql-airflow-search"
- Create the S3 bucket in which the PG data will be stored.
brouberol@stat1008:~$ read access_key REDACTED brouberol@stat1008:~$ read secret_key REDACTED brouberol@stat1008:~$ s3cmd --access_key=$access_key --secret_key=$secret_key --host=rgw.eqiad.dpe.anycast.wmnet --region=dpe --host-bucket=no mb s3://postgresql-airflow-search.dse-k8s-eqiad Bucket 's3://postgresql-airflow-search.dse-k8s-eqiad/' created
- Add the S3 keys to the private secret repository, into
hieradata/role/common/deployment_server/kubernetes.yaml
... postgresql-airflow-search: dse-k8s-eqiad: s3: accessKey: <S3 access key> secretKey: <S3 secret key> cluster: initdb: import: password: <PG password> ...
- Add the
airflow-search
namespace to the list of cloudnative-pg tenant namespaces, inhelmfile.d/admin_ng/values/dse-k8s-eqiad/cloudnative-pg-values.yaml
. Deployadmin_ng
. - Uncomment the sections related to postgresql/cloudnative-pg from the airflow app helmfile, and add a
values-postgresql-airflow-search.yaml
containing the following datacluster: initdb: import: host: an-db1001.eqiad.wmnet user: airflow_search dbname: airflow_search external_services: postgresql: [analytics]
- Before deploying, edit the airflow webserver and scheduler deployment to 0 replicas. This will induce a downtime for Airflow, so make sure to reach out to the team beforehand.
- For airflow-analytics only, perform a backup of the
airflow_analytics
DBbrouberol@an-db1001:~$ sudo -i -u postgres postgres@an-db1001:~$ pg_dump airflow_analytics > "airflow_analytics_$(date -I).sql"
- For airflow-analytics only, drop any log and job data older than 365 days (see https://phabricator.wikimedia.org/T380614) by ssh-ing to
an-client1002.eqiad.wmnet
and runningbrouberol@an-launcher1002:~$ sudo -u analytics /srv/airflow-analytics/bin/airflow-analytics db clean --tables log,job --clean-before-timestamp $(date -d "$(date '+%Y-%m-%d') - 365 days" '+%Y-%m-%d %H:%M:%S')
- Deploy by running
helmfile -e dse-k8s-eqiad --selector 'name=postgresql-airflow-search' apply
to only deploy the cloudnative PG pods. - Remove all PG-related secrets from the private puppet repository.
... postgresql-airflow-search: dse-k8s-eqiad: s3: accessKey: REDACTED secretKey: REDACTED - cluster: - initdb: - import: - password: REDACTED airflow-search: dse-k8s-eqiad: config: private: airflow__webserver__secret_key: REDACTED airflow: - postgresqlPass: REDACTED aws_access_key_id: REDACTED aws_secret_access_key: REDACTED ...
- Once all pods are healthy, empty the
values-postgresql-airflow-search.yaml
file of all values, leaving it completely empty. Redeploy withhelmfile -e dse-k8s-eqiad --selector 'name=postgresql-airflow-search' apply
, which shouldn't restart any pod. - Apply this diff to the
values-production.yaml
fileconfig: airflow: dags_folder: search instance_name: search - dbHost: an-db1001.eqiad.wmnet - dbName: airflow_search - dbUser: airflow_search auth: role_mappings: airflow-analytics-search-ops: [Op] @@ -16,18 +13,12 @@ config: oidc: client_id: airflow_search -external_services: - postgresql: [analytics] - ingress: gatewayHosts: default: "airflow-search" extraFQDNs: - airflow-search.wikimedia.org -postgresql: - cloudnative: false
- Run
helmfile -e dse-k8s-eqiad --selector 'name=production' apply
to redeploy airflow, that will now connect to the PGBouncer pods of the cloudnative PG cluster. - For airflow-analytics only, delete the backup (taken in a previous step) from
an-db1001
.