Jump to content

Data Platform/Systems/Ceph/Upgrading

From Wikitech

Ceph Cluster Upgrades

We do not use Cephadm for this cluster, so the ceph orch upgrade command is not available to us.

Therefore, we must define our own best practices for upgrades.

The best reference we have is the staggered upgrade path, as described here: https://docs.ceph.com/en/quincy/cephadm/upgrade/#staggered-upgrade

This advises upgrading the components in the following order:

mgr -> mon -> crash -> osd -> mds -> rgw -> rbd-mirror -> cephfs-mirror -> iscsi -> nfs

Monitor the cluster throughout the process

Open a terminal to one of the cephosd100[1-5] servers and execute the command sudo ceph health -w

This should show output that is similar to the following:

btullis@cephosd1001:~$ sudo ceph health -w
  cluster:
    id:     6d4278e1-ea45-4d29-86fe-85b44c150813
    health: HEALTH_OK
 
  services:
    mon: 5 daemons, quorum cephosd1001,cephosd1002,cephosd1003,cephosd1004,cephosd1005 (age 4M)
    mgr: cephosd1003(active, since 3w), standbys: cephosd1002, cephosd1005, cephosd1004, cephosd1001
    mds: 3/3 daemons up, 2 standby
    osd: 100 osds: 100 up (since 4M), 100 in (since 11M)
    rgw: 5 daemons active (5 hosts, 1 zones)
 
  data:
    volumes: 3/3 healthy
    pools:   17 pools, 4481 pgs
    objects: 1.02M objects, 583 GiB
    usage:   30 TiB used, 1.1 PiB / 1.1 PiB avail
    pgs:     4481 active+clean
 
  io:
    client:   1.1 MiB/s rd, 3.6 MiB/s wr, 34 op/s rd, 76 op/s wr

You can also keep an eye on the Ceph cluster health dashboard: https://grafana.wikimedia.org/goto/wmqNOroHg?orgId=1

Update the packages on the APT repository

The first thing to do is to sync the external repository with Reprepro. This should be done on the currently active APT repository server, which is currently apt1002.wikimedia.org.

sudo -i reprepro --noskipold -C thirdparty/ceph-reef update bullseye-wikimedia

sudo -i reprepro --noskipold -C thirdparty/ceph-reef update bookworm-wikimedia

We can then either wait 24 hours, or force a package update on the affected hosts:

sudo cumin A:cephosd 'apt update'

Distribute the packages to the Ceph servers

The packages do not automatically restart the daemons, so we can use Debdeploy to manage the package deployment.

We can verify which package version is available for install with apt-cache

btullis@cephosd1001:~$ apt-cache policy ceph-common
ceph-common:
  Installed: 18.2.2-1~bpo12+1
  Candidate: 18.2.4-1~bpo12+1
  Version table:
     18.2.4-1~bpo12+1 1003
       1003 http://apt.wikimedia.org/wikimedia bookworm-wikimedia/thirdparty/ceph-reef amd64 Packages
 *** 18.2.2-1~bpo12+1 100
        100 /var/lib/dpkg/status
     16.2.15+ds-0+deb12u1 500
        500 http://mirrors.wikimedia.org/debian bookworm/main amd64 Packages
        500 http://security.debian.org/debian-security bookworm-security/main amd64 Packages

Now generate a debdeploy spec. We can use the type library as no daemons are restarted.

btullis@cumin1002:~$ generate-debdeploy-spec -U library --comment T389184 ceph

<output trimmed for brevity>

Please enter the version of ceph fixed in bookworm. Leave blank if no fix is available/required for bookworm.
>18.2.4-1~bpo12+1
Please enter the version of ceph fixed in bullseye. Leave blank if no fix is available/required for bullseye.
>18.2.4-1~bpo12+1
Please enter the version of ceph fixed in buster. Leave blank if no fix is available/required for buster.
>

<output trimmed for brevity>

Spec file created as 2025-04-01-ceph.yaml

We can distribute the packages to a single host with debdeploy

btullis@cumin1002:~$ sudo debdeploy deploy -u 2025-04-01-ceph.yaml -Q cephosd1001.eqiad.wmnet
Rolling out ceph:
Library update, several services might need to be restarted

ceph-base was updated: 18.2.2-1~bpo12+1 -> 18.2.4-1~bpo12+1
  cephosd1001.eqiad.wmnet (1 hosts)

<output trimmed for brevity>

ceph-mon was updated: 18.2.2-1~bpo12+1 -> 18.2.4-1~bpo12+1
  cephosd1001.eqiad.wmnet (1 hosts)

After this, we can verify that no packages were restarted by checking with systemctl status

btullis@cephosd1001:~$ systemctl status ceph*|grep -B1 Active

     Loaded: loaded (/lib/systemd/system/ceph-mds.target; enabled; preset: enabled)
     Active: active since Mon 2024-11-25 12:34:38 UTC; 4 months 5 days ago

<output trimmed for brevity>

     Loaded: loaded (/lib/systemd/system/ceph-crash.service; enabled; preset: enabled)
     Active: active (running) since Mon 2024-11-25 12:34:38 UTC; 4 months 5 days ago

We can then distribute the packages to the remaining hosts with:

sudo debdeploy deploy -u 2025-04-01-ceph.yaml -s cephosd

Restart the ceph-mgr services

On a single host, check the status of the ceph-mgr processes with:

systemctl status --with-dependencies --after ceph-mgr.target

There should be a ceph-mgr@$HOSTNAME.service unit running and shown as active. Restart the ceph-mgr.target and check that it starts successfully.

btullis@cephosd1001:~$ systemctl status ceph-mgr@cephosd1001.service  ceph-mgr@cephosd1001.service - Ceph cluster manager daemon
     Loaded: loaded (/lib/systemd/system/ceph-mgr@.service; enabled; preset: enabled)
     Active: active (running) since Tue 2025-04-01 21:02:48 UTC; 1min 8s ago
   Main PID: 1331027 (ceph-mgr)
      Tasks: 165 (limit: 308998)
     Memory: 487.2M
        CPU: 8.659s
     CGroup: /system.slice/system-ceph\x2dmgr.slice/ceph-mgr@cephosd1001.service
             └─1331027 /usr/bin/ceph-mgr -f --cluster ceph --id cephosd1001 --setuser ceph --setgroup ceph

The ceph health -w monitor should also show that the manager daemon has restarted and reconnected.

2025-04-01T21:00:00.000205+0000 mon.cephosd1001 [INF] overall HEALTH_OK
2025-04-01T21:02:51.837589+0000 mon.cephosd1001 [INF] Active manager daemon cephosd1003 restarted
2025-04-01T21:02:51.840883+0000 mon.cephosd1001 [INF] Activating manager daemon cephosd1003
2025-04-01T21:02:51.920379+0000 mon.cephosd1001 [INF] Manager daemon cephosd1003 is now available

We can also check the version.

btullis@cephosd1001:~$ sudo ceph tell mgr version
{
    "version": "18.2.4",
    "release": "reef",
    "release_type": "stable"
}

If there are no error messages shown, we can now restart the ceph-mgr.target units on the other servers with a Cumin command like this:

sudo cumin -b 1 -s 15 'A:cephosd and not D{cephosd1001.eqiad.wmnet}' 'systemctl restart ceph-mgr.target'

This will restart each of the four remaining ceph-mgr daemons with a 15 second gap in between them.

Check the ceph health -w output and make sure that the cluster is healthy before proceeding.

Restart the ceph-mon services

In a similar way to the mgr services, we will restart a single mon daemon. If it is stable, then we can use a cumin command to restart the others.

Unlike the mgr services, the mon services are all active, so we can check the versions of all mon daemons as follows:

btullis@cephosd1003:~$ sudo ceph tell mon.* version
mon.cephosd1001: {
    "version": "18.2.2",
    "release": "reef",
    "release_type": "stable"
}
mon.cephosd1002: {
    "version": "18.2.2",
    "release": "reef",
    "release_type": "stable"
}
mon.cephosd1003: {
    "version": "18.2.2",
    "release": "reef",
    "release_type": "stable"
}
mon.cephosd1004: {
    "version": "18.2.2",
    "release": "reef",
    "release_type": "stable"
}
mon.cephosd1005: {
    "version": "18.2.2",
    "release": "reef",
    "release_type": "stable"
}

Choose a single host again and restart its mon service, preferably using the ceph-mon.target unit.

btullis@cephosd1001:~$ sudo systemctl restart ceph-mon.target
btullis@cephosd1001:~$ echo $?
0

Check the status of the units and look for any errors.

btullis@cephosd1001:~$ systemctl status --with-dependencies --after ceph-mon.target
● ceph-mon.target - ceph target allowing to start/stop all ceph-mon@.service instances at once
     Loaded: loaded (/lib/systemd/system/ceph-mon.target; enabled; preset: enabled)
     Active: active since Tue 2025-04-01 21:20:41 UTC; 43s ago

Apr 01 21:20:41 cephosd1001 systemd[1]: Reached target ceph-mon.target - ceph target allowing to start/stop all ceph-mon@.service instances at once.

● ceph-mon@cephosd1001.service - Ceph cluster monitor daemon
     Loaded: loaded (/lib/systemd/system/ceph-mon@.service; enabled; preset: enabled)
     Active: active (running) since Tue 2025-04-01 21:20:41 UTC; 43s ago
   Main PID: 1051270 (ceph-mon)
      Tasks: 25
     Memory: 202.6M
        CPU: 2.722s
     CGroup: /system.slice/system-ceph\x2dmon.slice/ceph-mon@cephosd1001.service
             └─1051270 /usr/bin/ceph-mon -f --cluster ceph --id cephosd1001 --setuser ceph --setgroup ceph

Check the version numbers again:

btullis@cephosd1001:~$ sudo ceph tell mon.* version
mon.cephosd1001: {
    "version": "18.2.4",
    "release": "reef",
    "release_type": "stable"
}
mon.cephosd1002: {
    "version": "18.2.2",
    "release": "reef",
    "release_type": "stable"
}
mon.cephosd1003: {
    "version": "18.2.2",
    "release": "reef",
    "release_type": "stable"
}
mon.cephosd1004: {
    "version": "18.2.2",
    "release": "reef",
    "release_type": "stable"
}
mon.cephosd1005: {
    "version": "18.2.2",
    "release": "reef",
    "release_type": "stable"
}

Check the ceph health -w output again, to make sure that the cluster health is OK. If it is, we can proceed to restart the remaining mon daemons. This time, we will use a slightly longer gap in between them.

sudo cumin -b 1 -s 30 'A:cephosd and not D{cephosd1001.eqiad.wmnet}' 'systemctl restart ceph-mon.target'

The updated mon versions will also be shown on the Grafana dashboard: https://grafana.wikimedia.org/goto/nngxc9oNR?orgId=1

Restart the ceph-crash servers

The next component to be upgraded will be the ceph-crash daemons. These differ from the other services in that they are not shown on the ceph health output, so we can only use the systemctl status output.

Once again, restart the services on a test host first.

btullis@cephosd1001:~$ sudo systemctl restart ceph-crash.service
btullis@cephosd1001:~$ echo $?
0
btullis@cephosd1001:~$ systemctl status ceph-crash.service  ceph-crash.service - Ceph crash dump collector
     Loaded: loaded (/lib/systemd/system/ceph-crash.service; enabled; preset: enabled)
     Active: active (running) since Tue 2025-04-01 21:37:01 UTC; 11s ago
   Main PID: 1057734 (ceph-crash)
      Tasks: 1 (limit: 308998)
     Memory: 6.6M
        CPU: 873ms
     CGroup: /system.slice/ceph-crash.service
             └─1057734 /usr/bin/python3 /usr/bin/ceph-crash

If all is well, then repeat the process with cumin.

sudo cumin -b 1 -s 15 'A:cephosd and not D{cephosd1001.eqiad.wmnet}' 'systemctl restart ceph-crash.service'

Set the cluster into noout mode

This step is important because noout mode will prevent unnecessary data movement when the ceph-osd services are restarted.

The cluster will show a warning when noout mode is enabled.

From any cephosd host issue the following command.

sudo ceph osd set noout

The ceph health output will now show the following.

btullis@cephosd1001:~$ sudo ceph health -w
  cluster:
    id:     6d4278e1-ea45-4d29-86fe-85b44c150813
    health: HEALTH_WARN
            noout flag(s) set
 
  services:
    mon: 5 daemons, quorum cephosd1001,cephosd1002,cephosd1003,cephosd1004,cephosd1005 (age 12m)
    mgr: cephosd1003(active, since 29m), standbys: cephosd1001, cephosd1005, cephosd1002, cephosd1004
    mds: 3/3 daemons up, 2 standby
    osd: 100 osds: 100 up (since 4M), 100 in (since 11M)
         flags noout
    rgw: 5 daemons active (5 hosts, 1 zones)
 
  data:
    volumes: 3/3 healthy
    pools:   17 pools, 4481 pgs
    objects: 1.02M objects, 583 GiB
    usage:   30 TiB used, 1.1 PiB / 1.1 PiB avail
    pgs:     4481 active+clean
 
  io:
    client:   38 KiB/s rd, 1.8 MiB/s wr, 39 op/s rd, 90 op/s wr

Restart the ceph-osd services

We can also check the version number of all OSDs with sudo ceph tell osd.* version

Now restart all of the 20 OSD daemons on a test host with the command

sudo systemctl restart ceph-osd.target

Check that there are no failed services with:

btullis@cephosd1001:~$ systemctl --failed
  UNIT LOAD ACTIVE SUB DESCRIPTION
0 loaded units listed.

You can check the status of all ceph-osd units with:

systemctl status --with-dependencies --after ceph-osd.target

If all is well, then we can proceed to upgrade the remaining osd servers. We will pause for 3 minutes between each of the four remaining servers and monitor the cluster health.

sudo cumin -b 1 -s 180 'A:cephosd and not D{cephosd1001.eqiad.wmnet}' 'systemctl restart ceph-osd.target'

You can check the version of the osd daemons on the Ceph Cluster dashboard: https://grafana.wikimedia.org/goto/3IK6prTHR?orgId=1

Check the cluster health and if all is well, proceed.

Take the cluster out of noout mode

Now that the OSD daemons have restarted, we can take the cluster out of noout mode. This will re-enable the data movement if a degraded situation occurs and will remove the health warning.

sudo ceph osd unset noout

Restart the ceph-mds services

Check the versions of all mds components.

sudo ceph tell mds.* version

We have 5 mds daemons running, of which 3 are intended to be active at any time, since we have 3 cephfs file systems. We can check which of these meph-mds daemons is currently active by using the command sudo ceph fs dump and examining the output. For example.

btullis@cephosd1001:~$ sudo ceph fs dump | egrep '(Filesystem|up:active)'
dumped fsmap epoch 8205
Filesystem 'dpe' (1)
[mds.cephosd1004{0:17963228} state up:active seq 239868 addr [v2:10.64.134.12:6800/440240334,v1:10.64.134.12:6801/440240334] compat {c=[1],r=[1],i=[7ff]}]
Filesystem 'dumps' (2)
[mds.cephosd1002{0:17507916} state up:active seq 70 addr [v2:10.64.131.21:6800/3148591687,v1:10.64.131.21:6801/3148591687] compat {c=[1],r=[1],i=[7ff]}]
Filesystem 'home' (3)
[mds.cephosd1001{0:17555995} state up:active seq 56 addr [v2:10.64.130.13:6800/801572634,v1:10.64.130.13:6801/801572634] compat {c=[1],r=[1],i=[7ff]}]

In this case, we will start by restarting one of the standby mds daemons running on cephosd1003.

btullis@cephosd1003:~$ sudo systemctl restart ceph-mds.target
btullis@cephosd1003:~$ echo $?
0
btullis@cephosd1003:~$ systemctl status ceph-mds@cephosd1003.service  ceph-mds@cephosd1003.service - Ceph metadata server daemon
     Loaded: loaded (/lib/systemd/system/ceph-mds@.service; enabled; preset: enabled)
     Active: active (running) since Wed 2025-04-02 08:59:45 UTC; 8s ago
   Main PID: 1691134 (ceph-mds)
      Tasks: 16
     Memory: 16.5M
        CPU: 85ms
     CGroup: /system.slice/system-ceph\x2dmds.slice/ceph-mds@cephosd1003.service
             └─1691134 /usr/bin/ceph-mds -f --cluster ceph --id cephosd1003 --setuser ceph --setgroup ceph

Restart all of the remaining ceph-mds services using cumin, with a 30 second pause between them.

sudo cumin -b 1 -s 30 'A:cephosd and not D{cephosd1003.eqiad.wmnet}' 'systemctl restart ceph-mds.target'

Restart the ceph-radosgw services

On any test host, check the current state of the ceph-radosgw.target unit and its dependencies.

btullis@cephosd1001:~$ systemctl status --with-dependencies --after ceph-radosgw.target
● ceph-radosgw.target - ceph target allowing to start/stop all ceph-radosgw@.service instances at once
     Loaded: loaded (/lib/systemd/system/ceph-radosgw.target; enabled; preset: enabled)
     Active: active since Mon 2024-11-25 12:34:38 UTC; 4 months 6 days ago

Notice: journal has been rotated since unit was started, output may be incomplete.

● ceph-mon.target - ceph target allowing to start/stop all ceph-mon@.service instances at once
     Loaded: loaded (/lib/systemd/system/ceph-mon.target; enabled; preset: enabled)
     Active: active since Tue 2025-04-01 21:20:41 UTC; 13h ago

Apr 01 21:20:41 cephosd1001 systemd[1]: Reached target ceph-mon.target - ceph target allowing to start/stop all ceph-mon@.service instances at once.

● ceph-radosgw@radosgw.service - Ceph rados gateway
     Loaded: loaded (/lib/systemd/system/ceph-radosgw@.service; enabled; preset: enabled)
     Active: active (running) since Tue 2025-03-11 16:10:19 UTC; 3 weeks 0 days ago
   Main PID: 2984021 (radosgw)
      Tasks: 615
     Memory: 2.5G
        CPU: 3h 59min 10.864s
     CGroup: /system.slice/system-ceph\x2dradosgw.slice/ceph-radosgw@radosgw.service
             └─2984021 /usr/bin/radosgw -f --cluster ceph --name client.radosgw --setuser ceph --setgroup ceph

Check the http_status of the requests passing through the server with

journalctl -u ceph-radosgw@radosgw.service -f|grep http_status

They should all be 200 although 4xx errors are acceptable. There should not be any 5xx errors. Restart the ceph-radosgw.target service on this host.

btullis@cephosd1001:~$ sudo systemctl restart ceph-radosgw.target
btullis@cephosd1001:~$ echo $?
0

Check the status of the ceph-radosgw units again.

systemctl status --with-dependencies --after ceph-radosgw.target

Check the logs again.

journalctl -u ceph-radosgw@radosgw.service -f|grep http_status

If all is well, proceed to restart the remainder of the ceph-radosgw.target units on the cluster.

sudo cumin -b 1 -s 30 'A:cephosd and not D{cephosd1001.eqiad.wmnet}' 'systemctl restart ceph-radosgw.target'

Once again, the Ceph Cluster dashboard can be used to monitor the versions of radosgw in production: https://grafana.wikimedia.org/goto/wrst_jTNg?orgId=1

Perform a rolling reboot of the cluster

At this point, all services have now been upgraded and roll-restarted, since we do not currently use the rbd-mirror, cephfs-mirror, iscsi, nor nfs daemons.

It is a good idea to perform a rolling-restart of the cluster, to ensure that everything comes up as expected during the boot process.

We have a cookbook for this and it can be launched like this:

sudo cookbook sre.ceph.roll-restart-reboot-server --alias cephosd --reason "Reboot post upgrade to latest point release" --task-id T389184 reboot

Distribute the updated packages to Ceph clients