Data Platform/Systems/Ceph

The Data Platform Engineering team manages a pair of Ceph clusters for two primary purposes:

To consider the use of Ceph and its S3 compatible interface as a replacement for (or addition to) Hadoop's HDFS file system
To provide block and file storage capability to workloads running on the dse-k8s Kubernetes cluster.

A number of specific use cases have already been implemented, such as:

Postgres using the cloudnativepg operator, where primary data is stored on RBD and backups are stored on S3
Airflow:
- Task log storage is on S3
- Kerberos credential cache is on cephfs
- DAGs are distributed using cephfs
Flink checkpoints are stored on S3

Project Status

The cluster is in a production state. We manage five servers in eqiad. There is a smaller cluster of three servers ready to be configured in codfw.

Here is the original design document for the project. The Phabricator epic ticket is for the eqiad cluster deployment is: T324660

Software Components

The following sections give a brief explanation of each of the four principal software components involved in our Ceph Storage Cluster, with a reference to how we run them.

Please refer to Ceph Architecture for a more in-depth explanation of each component. Note that we do not use CephFS on our cluster, so the Metadata Server (MDS) component has been omitted.

Monitor Daemons

Ceph monitor daemons are responsible for maintaining a cluster map, which keeps an up-to-date record of the cluster topology and the location of data objects.

The Paxos algorithm is used to ensure that a quorum of servers are in agreement about the contents of the cluster map.

We run a monitor daemon on all five of our Ceph servers in eqiad, so our quorum is three active mon servers.

Manager Daemons

Ceph manager daemons run alongside monitor daemons in order to provide additional monitoring and management interfaces. Only one manager daemon is ever active, but there may be several standby servers and ceph itself manages the election of a new mgr daemon to the active role. In our eqiad cluster we have five manager daemons, since we run one alongside each monitor daemon.

Object Storage Daemons

A Ceph cluster will contain many Object Storage Daemons (referred to as OSDs). There is usually a 1:1 relationship between an OSD and a physical storage device, such as a hard drive (HDD) or a solid-state drive (SSD). These OSDs may started and stopped individually, in order to bring storage devices into and out of service.

Our clusters use the Bluestore specification of OSD, rather than the deprecated Filestore specification. This means that each OSD maintains its own RocksDB database of object metatadata, complete with write ahead log (WAL). The physical location of the WAL and the block.db database file can be specified independently of the OSD backing store, in order to optimise performance and availability. Please see https://docs.ceph.com/en/reef/rados/configuration/bluestore-config-ref/ for a more comprehensive explanation of Bluestore.

In our eqiad cluster we have 20 OSDs per host, for a total of 100 OSDs in the cluster. 60 of these are backed by hard drives and 40 by SSDs. We use a high-performance NVMe drive to host the WAL and RocksDB databases of the HDD backed OSDs.

Rados Gateway Daemons

The rados gateway daemons (referred to as radosgw) are different from the mon, mgr, and osd daemons in that they are not an internal component of the Ceph cluster. They are clients of the cluster.

They serve an HTTP interface that enables the Ceph Oject Gateway, which is the S3 and Swift compatible interface to the storage services.

In our eqiad cluster we run a rados gateway on each of our five hosts and the hostname of our S3/Swift interface is: https://rgw.eqiad.dpe.anycast.wmnet

Cluster Architecture

At present we run a co-located configuration on our cluster in eqiad comprising five hosts, each of which is in a 2U enclosure that is optimised for storage.

The server names are: cephosd100[1-5].eqiad.wmnet and they are identical to each other in terms of hardware.

Each server contains two shelves, each containing twelve hot-swappable 3.5" drive bays. At the rear of the chassis are two hot-swappable bays, which contain drives that are used for the operating system.

Each host currently has a 10 Gbps network connection to its switch. If we start to hit this throughput ceiling, we can request to increase this to 25 Gbps by changing the optics.

Storage Device Configuration

Each of the five hosts has the following primary storage devices:

The specific storage devices are as follows:


Count	Capacity	Technology	Make/Model	Total Capacity	Use Case
12	18 TB	HDD	Seagate Exos X18 nearline SAS	216 TB	Cold tier
8	3.8 TB	SSD	Kioxia RM6 mixed-use	30.4 TB	Hot tier

That makes the raw capacity of the five-node cluster:

Cold tier: 1.08 PB
Hot tier: 152 TB

In order to increase the performance of the cold tier, which is backed by hard drives, we employ an NVMe device as a cache device. This is partitioned such that each of the HDD based OSD daemons can store its Bluestore database and journal on it, increasing performance of these systems considerably.

Ceph Software Configuration

Each of our hosts runs the following Ceph services:

1 monitor daemon (ceph-mon)
1 manager daemon (ceph-mgr)
1 metadata service daemon (ceph-mds)
20 object storage daemons (ceph-osd)
1 crash monitoring daemon (ceph-crash)
1 rados gateway daemon (radosgw)

btullis@cephosd1004:~$ pstree -T|egrep 'ceph|radosgw'
        |-ceph-crash
        |-ceph-mds
        |-ceph-mgr
        |-ceph-mon
        |-20*[ceph-osd]
        |-radosgw

This cluster uses Ceph packages that are distributed by the upstream project at: https://docs.ceph.com/en/latest/install/get-packages/#apt and are integrated with reprepro.

Puppet is used to configure all of the daemons on this cluster.

Additional Services

Alongside the Ceph daemons, each of the hosts in the cluster runs the following components:

An instance of envoy which is operating as a TLS termination endpoint for the local radosgw service.
An instance of bird, which is providing support for the anycast load-balancing that we use for the radosgw service.

Cluster Configuration

Configuration Methods

The cluster configuration is partly managed with puppet and partly by-hand.

Puppet Configuration

We intend to migrate more of the configuration into puppet over time. One thing to bear in mind is that we share the ceph puppet module with the WMCS team, as they also use it for their clusters. However, we apply different puppet profiles (ceph vs cloudceph) to our clusters, in order to allow for some variance in the configuration style. Be aware of this potential impact on the WMS team when modifying puppet. In addition to this, be aware that there is a cephadm profile and a cephadm module in puppet. These are not relevant to the configuration of the DPE Ceph cluster, as they are only in use for the new apus cluster that is managed by the Data Persistence team.

Elements of the cluster configuration that are managed by puppet:

Ceph package installation.
The /etc/ceph/ceph.conf configuration file.
All daemons (mon, mgr, osd, crash, radosgw)
Preparation and activation of the hardware storage devices beneath each of the osd daemons.
All Ceph client users (Note that this is a different concept from a Ceph object storage user, which are not yet managed with puppet).

Manual Configuration

The following are currently managed by hand on the cluster.

Pool creation
Associating pools to applications
CRUSH maps and rules, which affect data placement
Maintenance flags, such as noout
radosgw user management, for S3 and Swift access (Created task T374531to address this.)

CRUSH Rules

CRUSH is an acronym for Controlled Replication Under Scalable Hashing. It is the algorithm that determines where in the storage cluster any item of data should reside, including attributes such as the number of replicas of that item and/or any parity information that would allow the item to be reconstructed in the event of any loss of a storage device.

For more detailed information on CRUSH, please refer to https://docs.ceph.com/en/reef/rados/operations/crush-map/

CRUSH rules are used to generate CRUSH maps. The idea behind a CRUSH map is that the Ceph monitor (aka mon) servers load the CRUSH maps into memory and enables clients to locate data within the cluster. When a client requests data from a Ceph cluster, the mon responds with the location of the data including on which osd(s) the data resides. The idea is to avoid a network bottleneck, since the mon does not proxy the data itself. Clients communicate directly with the osd processes when reading and writing data. This is analagous to the way in which HDFS namenodes provide a metadata service for clients to communicate with HDFS datanodes.

We currently have the following two CRUSH rules in place on the DPE Ceph cluster.

btullis@cephosd1005:~$ sudo ceph osd crush rule ls
hdd
ssd

btullis@cephosd1004:~$ sudo ceph osd crush rule dump
[
    {
        "rule_id": 1,
        "rule_name": "hdd",
        "type": 1,
        "steps": [
            {
                "op": "take",
                "item": -4,
                "item_name": "default~hdd"
            },
            {
                "op": "chooseleaf_firstn",
                "num": 0,
                "type": "host"
            },
            {
                "op": "emit"
            }
        ]
    },
    {
        "rule_id": 2,
        "rule_name": "ssd",
        "type": 1,
        "steps": [
            {
                "op": "take",
                "item": -6,
                "item_name": "default~ssd"
            },
            {
                "op": "chooseleaf_firstn",
                "num": 0,
                "type": "host"
            },
            {
                "op": "emit"
            }
        ]
    }
]

The hdd and ssd rules are for using replicated pools, selecting for the corresponding device classes. We have configured our cluster with buckets for row and rack awareness, so the CRUSH algorithm is aware of the host placement within the rows.

btullis@cephosd1005:~$ sudo ceph osd crush tree|grep -v 'osd\.'
ID   CLASS  WEIGHT      TYPE NAME                   
 -1         1149.93103  root default                
-19          689.95862      row eqiad-e             
-21          229.98621          rack e1             
 -3          229.98621              host cephosd1001
-22          229.98621          rack e2             
 -7          229.98621              host cephosd1002
-23          229.98621          rack e3             
-10          229.98621              host cephosd1003
-20          459.97241      row eqiad-f             
-24          229.98621          rack f1             
-13          229.98621              host cephosd1004
-25          229.98621          rack f2             
-16          229.98621              host cephosd1005

The osd objects (currently numbered 0-99) are then assigned to the host objects, so our data is always distributed between hosts, rows, and racks.

Pools

In a Ceph cluster, a pool is a logical partitioning of objects. Pools are associated with an application, specifically the mgr, rbd, radosgw, or cephfs applications.

We can list the pools with the ceph osd pool ls or ceph osd lspools commands.

btullis@cephosd1004:~$ sudo ceph osd lspools
2 .mgr
7 dse-k8s-csi-ssd
8 .rgw.root
9 eqiad.rgw.log
10 eqiad.rgw.control
11 eqiad.rgw.meta
12 eqiad.rgw.buckets.index
13 eqiad.rgw.buckets.data
14 eqiad.rgw.buckets.non-ec

All pool names beginning with a . are reserved for use internally by the cluster, so do not attempt to modify these pools. In our case we have the .mgr pool which is in use by the

In our case, we have created the dse-k8s-csi-ssd pool for use with the rbd application and the kubernetes Integration. It is backed by SSDs and is a replicated pool with 3 replicas.

All of those pools with rgw in their name are related to the radosgw application and underpin the S3/Swift object storage capabilities.

Kubernetes Integration

We have enabled Kubernetes block devices on the dse-k8s cluster, by means of the Ceph-CSI (Container Storage Interface) project.

Object Storage

We have enabled the Ceph Object Gateway (radosgw) in order to provide S3 and Swift compatible APIs and object storage.

User Management

Creating a ceph user

SREs can execute the following command from one of the ceph hosts (such as cephosd1001).

sudo radosgw-admin user create --uid=example --display-name="An Example"

You can check your work with the below command (also includes expected output).

 sudo radosgw-admin user info --uid=example
{
    "user_id": "example",
    "display_name": "example",
    "email": "",
    "suspended": 0,
    "max_buckets": 1000,
    "subusers": [],
    "keys": [
        {
            "user": "example",
            "access_key": "[...]",
            "secret_key": "[...]"
        }
    ],
    "swift_keys": [],
    "caps": [],
    "op_mask": "read, write, delete",
    "default_placement": "",
    "default_storage_class": "",
    "placement_tags": [],
    "bucket_quota": {
        "enabled": true,
        "check_on_raw": false,
        "max_size": -1,
        "max_size_kb": 0,
        "max_objects": -1
    },
    "user_quota": {
        "enabled": false,
        "check_on_raw": false,
        "max_size": 4398046511104,
        "max_size_kb": 4294967296,
        "max_objects": -1
    },
    "temp_url_keys": [],
    "type": "rgw",
    "mfa_ids": []
}

Setting quotas

sudo radosgw-admin quota set --quota-scope=user --uid=research --max-size=4T

Checking quotas

Accessing the S3 endpoint

Sample .s3cfg for s3cmd tool

Upgrading

Please see: Data Platform/Systems/Ceph/Upgrading

Troubleshooting

See Data Platform/Systems/Ceph/Troubleshooting