User:Jhedden/notes/Ceph-Old
This page is outdated!
The data here has been updated and migrated to Portal:Cloud VPS/Admin/Ceph
CloudVPS use cases
Phase 1
Block Storage
CloudVPS hypervisors using libvirtd and QEMU can attach to Ceph block devices using librbd (user space implementation of the Ceph block device).
Utilizing Ceph block devices will allow for fast virtual machine live migrations, persistent volume attachments through Cinder and copy-on-write snapshot capabilities.
Important notes: Ceph doesn’t support QCOW2 for hosting a virtual machine disk. Thus if you want to boot virtual machines in Ceph (ephemeral backend or boot from volume), the Glance image format must be RAW.
Future
NFS replacement
A potential solution could be CephFS mounted directly on the clients, or CephFS with Ganesha providing NFS services to clients
Bare-metal host configuration
BIOS
Base settings
- Ensure "Boot mode" is set to "BIOS" and not "UEFI". (This is required for the netboot process)
PXE boot settings
All Ceph hosts are equipped with a 10Gb Broadcom BCM57412 NIC and are not using the embedded onboard NIC.
- During the system boot, when prompted to configure the second Broadcom BCM57412 NIC device press "ctrl + s"
- On the main menu select "MBA Configuration" and toggle the "Boot Protocol" setting to "Preboot Execution Environment (PXE)"
- Press escape, then select "Exit and Save Configurations"
- After the system reboots, press "F2" to enter "System Setup"
- Navigate to "System BIOS > Boot Settings > BIOS Boot Settings"
- Select "Boot Sequence" and change the boot order to: "Hard dive C:", "NIC in Slot 2 Port 1..", "Embedded NIC 1 Port 1..."
- Exit System Setup, saving your changes and rebooting the system
Alternatively steps 4 through 7 can be replaced with racadm, but you will still need to enable the PXE boot protocol in the option ROM.
/admin1-> racadm set BIOS.BiosBootSettings.bootseq HardDisk.List.1-1,NIC.Slot.2-1-1,NIC.Embedded.1-1-1 /admin1-> racadm jobqueue create BIOS.Setup.1-1 /admin1-> racadm serveraction hardreset
Local storage
OSD Drives
Each OSD server has 8 x Intel D3-S4610 series SSDs dedicated for Ceph storage.
These drives are equipped with capacitors that provide "Enhanced Power Loss Data Protection", which will flush the drives write cache to disk in the event of a power loss. This feature also allows fsync (flush cache) calls to be safely ignored and improve transactional write IOPS.
Drive information
PD Type: SATA Raw Size: 1.746 TB [0xdf8fe2b0 Sectors] Non Coerced Size: 1.745 TB [0xdf7fe2b0 Sectors] Coerced Size: 1.745 TB [0xdf7c0000 Sectors] Sector Size: 512 Logical Sector Size: 512 Physical Sector Size: 4096 Firmware state: JBOD Device Firmware Level: DL63 Inquiry Data: PHYG915600S41P9DGN SSDSC2KG019T8R XCV1DL63
RAID configuration
Each OSD server is equipped with LSI Logic / Symbios Logic MegaRAID SAS-3 3108 storage adapter and 2 types of solid state drives:
- 2 x Intel SSD D3-S4610 Series SSDSC2KG240G8R (Operating System)
- 8 x Intel SSD D3-S4610 Series SSDSC2KG019T8R (Ceph Object Storage)
Ceph manages data redundancy in the application layer, because of this each OSD drive is configured as a JBOD in NON-RAID mode.
Using hardware RAID on the operating system drives was tested, but due to less than ideal device mapping with mixed RAID and NON-RAID devices software RAID was selected.
Operating System
All Ceph hosts will be using Debian 10 (Buster). There are some packaging concerns as the upstream deb packages are a few versions behind.
Test environment installation notes
Rook
Hieradata
profile::ceph::docker::settings: log-driver: json-file profile::ceph::docker::version: '5:19.03.0~3-0~debian-stretch' profile::ceph::etcd::bootstrap: true profile::ceph::k8s::apiserver: 'jeh-cephmon01.testlabs.eqiad.wmflabs' profile::ceph::k8s::node_token: '<MASKED>.<MASKED>' profile::ceph::k8s::pause_image: 'docker-registry.tools.wmflabs.org/pause:3.1' profile::ceph::k8s::pod_subnet: '192.168.0.0/16' profile::ceph::k8s::version: '1.15.5' profile::ceph::k8s::pkg_release: '00' profile::ceph::mon_hosts: - jeh-cephmon01.testlabs.eqiad.wmflabs - jeh-cephmon02.testlabs.eqiad.wmflabs - jeh-cephmon03.testlabs.eqiad.wmflabs puppetmaster: jeh-puppetmaster.testlabs.eqiad.wmflabs
Puppet Roles
role::wmcs::ceph::mon role::wmcs::ceph::osd
ETCD
After applying the roles with puppet check etcd health
etcdctl --endpoints https://$(hostname -f):2379 --key-file /var/lib/puppet/ssl/private_keys/$(hostname -f).pem --cert-file /var/lib/puppet/ssl/certs/$(hostname -f).pem cluster-health member 559a8dc863a539a is healthy: got healthy result from https://jeh-cephmon02.testlabs.eqiad.wmflabs:2379 member 60c132b2786c6c2 is healthy: got healthy result from https://jeh-cephmon01.testlabs.eqiad.wmflabs:2379 member d01fd114808eb37b is healthy: got healthy result from https://jeh-cephmon03.testlabs.eqiad.wmflabs:2379 cluster is healthy
remove the ` profile::ceph::etcd::bootstrap: true` hiera key and re-run puppet
REBOOT to clear up iptables rules
Kubeadmin
Initialize kubernetes and configure kubectl
kubeadm init --config /etc/kubernetes/kubeadm-init.yaml --upload-certs mkdir -p $HOME/.kube sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config sudo chown $(id -u):$(id -g) $HOME/.kube/config
Apply the base pod security profiles and calico manifests
kubectl apply -f /etc/kubernetes/psp/base-pod-security-policies.yaml kubectl apply -f /etc/kubernetes/calico.yaml
Add the other cephmon0[2-3] nodes to the control plane
kubeadm join jeh-cephmon01.testlabs.eqiad.wmflabs:6443 --token <TOKEN> \ --discovery-token-ca-cert-hash <CERT HASH> \ --control-plane --certificate-key <CERT KEY>
Add all the cephosd0[1-3] nodes as workers
cephosd01: ~# kubeadm join jeh-cephmon01.testlabs.eqiad.wmflabs:6443 --token <TOKEN> \ --discovery-token-ca-cert-hash <CERT HASH>
Untaint the cephmon0[1-3] nodes to allow pod workloads
kubectl taint nodes jeh-cephmon01 node-role.kubernetes.io/master- kubectl taint nodes jeh-cephmon02 node-role.kubernetes.io/master- kubectl taint nodes jeh-cephmon03 node-role.kubernetes.io/master-
Rook operator
Label nodes with rook roles
kubectl label nodes jeh-cephmon01 role=storage-mon kubectl label nodes jeh-cephosd01 role=storage-osd
Apply the rook manifests
kubectl create -f /etc/rook/common.yaml kubectl create -f /etc/rook/operator.yaml kubectl create -f /etc/rook/cluster.yaml kubectl create -f /etc/rook/toolbox.yaml
View the operator logs
kubectl -n rook-ceph logs -l "app=rook-ceph-operator"
Connect to the toolbox container and run ceph commands
kubectl -n rook-ceph exec -it $(kubectl -n rook-ceph get pod -l "app=rook-ceph-tools" -o jsonpath='{.items[0].metadata.name}') bash
Ceph-Deploy
Ceph deploy requires password-less SSH authentication between the storage cluster nodes. Based on this requirement the ceph-deploy utility was not evaluated. https://docs.ceph.com/docs/master/start/quick-start-preflight/#ceph-deploy-setup
Debian Packages
Prebuilt debs
Debian does not have buster packages for ceph available. More details at https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=907123
The full list of pre-built official upstream packages are available at https://download.ceph.com
Unofficial packages are available at https://croit.io/2019/07/07/2019-07-07-debian-mirror https://github.com/croit/ceph. These packages may include patches and/or enhancements made by Croit.
Building debs
Ceph includes support for building Debian Buster packages. If we plan to go this route, these packages will be uploaded to our apt repository. Packages can be built using the following process:
$ git clone https://github.com/ceph/ceph.git /srv/ceph $ cd /srv/ceph $ git checkout tags/v14.2.4 -b nautilus_latest $ ./install-deps.sh $ ./make-deb.sh
By default the Ceph packages will be located in `/tmp/release/Debian/WORKDIR`. NOTE the build process requires ~150GB of free space in /tmp.
List of packages used for testing
Dependencies
apt install \ binutils \ cryptsetup \ cryptsetup-bin \ libgoogle-perftools4 \ libibverbs1 \ libleveldb1d \ liblttng-ust0 \ liboath0 \ librabbitmq4 \ librdmacm1 \ python-bcrypt \ python-cherrypy3 \ python-pecan \ python-werkzeug
Ceph libraries
dpkg -i \ libcephfs2_14.2.4-1_amd64.deb \ librbd1_14.2.4-1_amd64.deb \ librgw2_14.2.4-1_amd64.deb \ librados2_14.2.4-1_amd64.deb \ libradosstriper1_14.2.4-1_amd64.deb
Ceph common and python modules
dpkg -i \ ceph-common_14.2.4-1_amd64.deb \ python-ceph-argparse_14.2.4-1_all.deb \ python-cephfs_14.2.4-1_amd64.deb \ python-rbd_14.2.4-1_amd64.deb \ python-rgw_14.2.4-1_amd64.deb \ python-rados_14.2.4-1_amd64.deb
Ceph Base and Services
dpkg -i \ ceph-base_14.2.4-1_amd64.deb \ ceph-mgr_14.2.4-1_amd64.deb \ ceph-mon_14.2.4-1_amd64.deb \ ceph-osd_14.2.4-1_amd64.deb
Puppet installation
Puppet modules have been built based on the manual installation procedures defined at https://docs.ceph.com/docs/master/install/
Puppet
Roles
- wmcs::ceph::mon Deploys the Ceph monitor and manager daemon to support CloudVPS hypervisors
- wmcs::ceph::osd Deploys the Ceph object storage daemon to support CloudVPS hypervisors
- role::wmcs::openstack::eqiad1::virt_ceph Deploys nova-compute configured with RBD based virtual machines
Profiles
- profile::ceph::client::rbd Install and configure a Ceph RBD client
- profile::ceph::osd Install and configure Ceph object storage daemon
- profile::ceph::mon Install and configure Ceph monitor and manager daemon
Modules
- ceph Install and configure the base Ceph installation used by all services and clients
- ceph::admin Configures the Ceph administrator keyring
- ceph::mgr Install and configure the Ceph manager daemon
- ceph::mon Install and configure the Ceph monitor daemon
- ceph::keyring Defined resource that manages access control and keyrings
Hieradata
Initial Ceph configuration
# Ceph configuration for testing RADOS block devices in CloudVPS # using filestore backend on virtual machines [global] auth cluster required = cephx auth service required = cephx auth client required = cephx fsid = 078d40c4-2f87-4f9b-9e61-1da3053bc925 mon initial members = cloudcephmon1001,cloudcephmon1002,cloudcephmon1003 [mon.cloudcephmon1001] host = cloudcephmon1001 mon addr = 208.80.154.148 [mon.cloudcephmon1002] host = cloudcephmon1002 mon addr = 208.80.154.149 [mon.cloudcephmon1003] host = cloudcephmon1003 mon addr = 208.80.154.150 [client] rbd cache = true rbd cache writethrough until flush = true admin socket = /var/run/ceph/guests/$cluster-$type.$id.$pid.$cctid.asok log file = /var/log/ceph/qemu/qemu-guest-$pid.log rbd concurrent management ops = 20
Post puppet procedures
Adding OSDs
Locate available disks with lsblk
cloudcephosd1001:~# lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT sda 8:0 0 223.6G 0 disk ├─sda1 8:1 0 46.6G 0 part │ └─md0 9:0 0 46.5G 0 raid1 / ├─sda2 8:2 0 954M 0 part │ └─md1 9:1 0 953M 0 raid1 [SWAP] └─sda3 8:3 0 176.1G 0 part └─md2 9:2 0 176G 0 raid1 └─cloudcephosd1001--vg-data 253:2 0 140.8G 0 lvm /srv sdb 8:16 0 223.6G 0 disk ├─sdb1 8:17 0 46.6G 0 part │ └─md0 9:0 0 46.5G 0 raid1 / ├─sdb2 8:18 0 954M 0 part │ └─md1 9:1 0 953M 0 raid1 [SWAP] └─sdb3 8:19 0 176.1G 0 part └─md2 9:2 0 176G 0 raid1 └─cloudcephosd1001--vg-data 253:2 0 140.8G 0 lvm /srv sdc 8:80 0 1.8T 0 disk sdd 8:96 0 1.8T 0 disk sde 8:80 0 1.8T 0 disk sdf 8:80 0 1.8T 0 disk sdg 8:96 0 1.8T 0 disk sdh 8:112 0 1.8T 0 disk sdi 8:128 0 1.8T 0 disk sdj 8:144 0 1.8T 0 disk
To prepare a disk for Ceph first zap the disk
cloudcephosd1001:~# ceph-volume lvm zap /dev/sdc --> Zapping: /dev/sdc --> --destroy was not specified, but zapping a whole device will remove the partition table Running command: /bin/dd if=/dev/zero of=/dev/sdc bs=1M count=10 stderr: 10+0 records in 10+0 records out 10485760 bytes (10 MB, 10 MiB) copied, 0.00357845 s, 2.9 GB/s --> Zapping successful for: <Raw Device: /dev/sdc>
Then prepare, activate and start the new OSD
cloudcephosd1001:~# ceph-volume lvm create --bluestore --data /dev/sde
Creating Pools
To create a new storage pool you will first need to determine the number of placement groups that will be assigned to the new pool. You can use the calculator at https://ceph.io/pgcalc/ to help identify the starting point (not you can easily increase, but not decrease this value):
sudo ceph osd pool create eqiad1-compute 512
Enable the RBD application for the new pool
sudo ceph osd pool application enable eqiad1-compute rbd
Rate limiting
Native
Native RBD rate limiting is supported in the Ceph Nautilus release. Due to upstream availability and multiple Debian releases we will likely have a mixture of older Ceph client versions during phase1.
Available rate limiting options and their defaults in the Nautilus release:
$ rbd config pool ls <pool> | grep qos rbd_qos_bps_burst 0 config rbd_qos_bps_limit 0 config rbd_qos_iops_burst 0 config rbd_qos_iops_limit 0 config rbd_qos_read_bps_burst 0 config rbd_qos_read_bps_limit 0 config rbd_qos_read_iops_burst 0 config rbd_qos_read_iops_limit 0 config rbd_qos_schedule_tick_min 50 config rbd_qos_write_bps_burst 0 config rbd_qos_write_bps_limit 0 config rbd_qos_write_iops_burst 0 config rbd_qos_write_iops_limit 0 config
OpenStack
IO rate limiting can also be managed using a flavor's metadata. This will trigger libvirt to apply `iotune` limits on the ephemeral disk.
- Available disk tuning options
- disk_read_bytes_sec
- disk_read_iops_sec
- disk_write_bytes_sec
- disk_write_iops_sec
- disk_total_bytes_sec
- disk_total_iops_sec
NOTE: Updating a flavors metadata does not have any effect on existing virtual machines.
Example commands to create or modify flavors metadata with rate limiting options roughly equal to a 7200RPM SATA Disk:
openstack flavor create \ --ram 2048 \ --disk 20 \ --vcpus 1 \ --private \ --project testlabs \ --id 857921a5-f0af-4069-8ad1-8f5ea86c8ba2 \ --property quota:disk_total_iops_sec=100 m1.small-ceph
openstack flavor set --property quota:disk_total_bytes_sec=$((100<<20)) 857921a5-f0af-4069-8ad1-8f5ea86c8ba2
Example rate limit configuration as seen by libvirt. (virsh dumpxml <instance name>)
<target dev='vda' bus='virtio'/> <iotune> <total_bytes_sec>104857600</total_bytes_sec> <total_iops_sec>100</total_iops_sec> </iotune>
Monitoring
Dashboards
The Grafana dashboards provided by the Ceph community have been installed and updated for our environment.
Upstream source: https://github.com/ceph/ceph/tree/master/monitoring/grafana/dashboards
- Ceph Cluster overview
- Ceph Host overview
- Ceph OSD overview
- Ceph OSD host details
- Ceph OSD device details
- Ceph Pool details
- Ceph Pool overview
- Ceph Estimates
Icinga alerts
Ceph Cluster Health
- Description
- Ceph storage cluster health check
- Status Codes
-
- 0 - healthy, all services are healthy
- 1 - warn, cluster is running in a degraded state, data is still accessible
- 2 - critical, cluster is failed, some or all data is inaccessible
- Next steps
-
- On one of the ceph monitor hosts (e.g. cloudcephmon1001.wikimedia.org) check the output of the command
sudo ceph --status
. Example output from a healthy cluster:
- On one of the ceph monitor hosts (e.g. cloudcephmon1001.wikimedia.org) check the output of the command
cloudcephmon1001:~$ sudo ceph --status cluster: id: 5917e6d9-06a0-4928-827a-f489384975b1 health: HEALTH_OK services: mon: 3 daemons, quorum cloudcephmon1001,cloudcephmon1002,cloudcephmon1003 (age 3w) mgr: cloudcephmon1002(active, since 10d), standbys: cloudcephmon1003, cloudcephmon1001 osd: 24 osds: 24 up (since 3w), 24 in (since 3w) data: pools: 1 pools, 256 pgs objects: 3 objects, 19 B usage: 25 GiB used, 42 TiB / 42 TiB avail pgs: 256 active+clean
Ceph Monitor Quorum
- Description
- Verify there are enough Ceph monitor daemons running for proper quorum
- Status Codes
-
- 0 - healthy, 3 or more Ceph Monitors are running
- 2 - critical, Less than 3 Ceph monitors are running
- Next steps
-
- On one of the ceph monitor hosts (e.g. cloudcephmon1001.wikimedia.org) check the output of the command
sudo ceph mon stat
. Example output from a healthy cluster:
- On one of the ceph monitor hosts (e.g. cloudcephmon1001.wikimedia.org) check the output of the command
cloudcephmon1001:~$ sudo ceph mon stat e1: 3 mons at {cloudcephmon1001=[v2:208.80.154.148:3300/0,v1:208.80.154.148:6789/0],cloudcephmon1002=[v2:208.80.154.149:3300/0,v1:208.80.154.149:6789/0],cloudcephmon1003=[v2:208.80.154.150:3300/0,v1:208.80.154.150:6789/0]}, election epoch 24, leader 0 cloudcephmon1001, quorum 0,1,2 cloudcephmon1001,cloudcephmon1002,cloudcephmon1003
Performance Testing
Network
Baseline (default tuning options)
Iperf options used to simulate Ceph storage IO.
-N disable Nagle's Algorithm -l 4M set read/write buffer size to 4 megabyte -P number of parallel client threads to run (one per OSD)
Server:
iperf -s -N -l 4M
Client:
iperf -c <server> -N -l 4M -P 8
cloudcephosd <-> cloudcephosd
[ ID] Interval Transfer Bandwidth [ 3] 0.0-10.0 sec 2.74 GBytes 2.35 Gbits/sec [ 10] 0.0-10.0 sec 2.74 GBytes 2.35 Gbits/sec [ 9] 0.0-10.0 sec 664 MBytes 557 Mbits/sec [ 6] 0.0-10.0 sec 720 MBytes 603 Mbits/sec [ 5] 0.0-10.0 sec 1.38 GBytes 1.18 Gbits/sec [ 13] 0.0-10.0 sec 1.38 GBytes 1.18 Gbits/sec [ 7] 0.0-10.0 sec 720 MBytes 602 Mbits/sec [ 8] 0.0-10.0 sec 720 MBytes 603 Mbits/sec [SUM] 0.0-10.0 sec 11.0 GBytes 9.42 Gbits/sec
cloudvirt1022 -> cloudcephosd
cloudvirt1022 <-> cloudcephosd: 8.55 Gbits/sec [ ID] Interval Transfer Bandwidth [ 7] 0.0-10.0 sec 1.11 GBytes 949 Mbits/sec [ 6] 0.0-10.0 sec 1.25 GBytes 1.07 Gbits/sec [ 4] 0.0-10.0 sec 1.39 GBytes 1.19 Gbits/sec [ 9] 0.0-10.0 sec 1.24 GBytes 1.06 Gbits/sec [ 10] 0.0-10.0 sec 1.07 GBytes 920 Mbits/sec [ 5] 0.0-10.0 sec 1.36 GBytes 1.16 Gbits/sec [ 3] 0.0-10.0 sec 1.41 GBytes 1.21 Gbits/sec [ 8] 0.0-10.0 sec 1.17 GBytes 1.00 Gbits/sec [SUM] 0.0-10.0 sec 10.0 GBytes 8.55 Gbits/sec
Ceph RBD
Test cases
FIO random read/write
$ fio --name fio-randrw \ --bs=4k \ --direct=1 \ --filename=/srv/fio.randrw \ --fsync=256 \ --gtod_reduce=1 \ --iodepth=64 \ --ioengine=libaio \ --randrepeat=1 \ --readwrite=randrw \ --rwmixread=50 \ --size=5G \ --group_reporting
FIO sequential read/write
$ fio --name=fio-seqrw \ --bs=4k \ --direct=1 \ --filename=/srv/fio.seqrw \ --fsync=256 \ --gtod_reduce=1 \ --iodepth=64 \ --ioengine=libaio \ --rw=rw \ --size=5G \ --group_reporting
Baseline (default tuning options)
single virtual machine
$ dd if=/dev/zero of=/srv/test.dd bs=4k count=125000 conv=sync 512000000 bytes (512 MB, 488 MiB) copied, 0.875202 s, 585 MB/s
FIO sequential read/write
fio-seqrw: (g=0): rw=rw, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=64 fio-3.12 Starting 1 process fio-seqrw: Laying out IO file (1 file / 5120MiB) Jobs: 1 (f=1): [M(1)][100.0%][r=12.1MiB/s,w=11.9MiB/s][r=3092,w=3045 IOPS][eta 00m:00s] fio-seqrw: (groupid=0, jobs=1): err= 0: pid=31970: Fri Jan 10 15:24:29 2020 read: IOPS=3849, BW=15.0MiB/s (15.8MB/s)(2561MiB/170310msec) bw ( KiB/s): min= 7048, max=41668, per=100.00%, avg=15403.11, stdev=7014.38, samples=340 iops : min= 1762, max=10417, avg=3850.78, stdev=1753.59, samples=340 write: IOPS=3846, BW=15.0MiB/s (15.8MB/s)(2559MiB/170310msec); 0 zone resets bw ( KiB/s): min= 6464, max=41365, per=100.00%, avg=15389.03, stdev=7006.27, samples=340 iops : min= 1616, max=10341, avg=3847.25, stdev=1751.56, samples=340 cpu : usr=3.43%, sys=11.35%, ctx=623109, majf=0, minf=9 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.4% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0% issued rwts: total=655676,655044,0,5021 short=0,0,0,0 dropped=0,0,0,0 latency : target=0, window=0, percentile=100.00%, depth=64 Run status group 0 (all jobs): READ: bw=15.0MiB/s (15.8MB/s), 15.0MiB/s-15.0MiB/s (15.8MB/s-15.8MB/s), io=2561MiB (2686MB), run=170310-170310msec WRITE: bw=15.0MiB/s (15.8MB/s), 15.0MiB/s-15.0MiB/s (15.8MB/s-15.8MB/s), io=2559MiB (2683MB), run=170310-170310msec Disk stats (read/write): vda: ios=656106/663558, merge=28/3888, ticks=3895800/3550224, in_queue=7399608, util=74.52%
Rate limiting enabled
single virtual machine
$ dd if=/dev/zero of=/srv/1test.dd bs=4k count=125000 conv=sync 512000000 bytes (512 MB, 488 MiB) copied, 4.57852 s, 112 MB/s
FIO sequential read/write
fio-seqrw: (g=0): rw=rw, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=64 fio-3.12 Starting 1 process Jobs: 1 (f=1): [M(1)][100.0%][r=228KiB/s,w=172KiB/s][r=57,w=43 IOPS][eta 00m:00s] fio-seqrw: (groupid=0, jobs=1): err= 0: pid=30958: Fri Jan 10 19:10:12 2020 read: IOPS=49, BW=198KiB/s (203kB/s)(2561MiB/13237587msec) bw ( KiB/s): min= 7, max= 584, per=100.00%, avg=201.54, stdev=48.09, samples=26014 iops : min= 1, max= 146, avg=50.33, stdev=12.02, samples=26014 write: IOPS=49, BW=198KiB/s (203kB/s)(2559MiB/13237587msec); 0 zone resets bw ( KiB/s): min= 7, max= 696, per=100.00%, avg=201.30, stdev=56.83, samples=26023 iops : min= 1, max= 174, avg=50.27, stdev=14.21, samples=26023 cpu : usr=0.16%, sys=0.62%, ctx=1208453, majf=0, minf=10 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.4% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0% issued rwts: total=655676,655044,0,5021 short=0,0,0,0 dropped=0,0,0,0 latency : target=0, window=0, percentile=100.00%, depth=64 Run status group 0 (all jobs): READ: bw=198KiB/s (203kB/s), 198KiB/s-198KiB/s (203kB/s-203kB/s), io=2561MiB (2686MB), run=13237587-13237587msec WRITE: bw=198KiB/s (203kB/s), 198KiB/s-198KiB/s (203kB/s-203kB/s), io=2559MiB (2683MB), run=13237587-13237587msec Disk stats (read/write): vda: ios=659758/669762, merge=0/9391, ticks=517938687/296952049, in_queue=800011092, util=98.11%
CloudVPS Configuration Changes
Base OS
Enabling jumbo frames (9k MTU) will improve the network throughput and overall network performance, as well as reduce the OSD CPU utilization.
Glance
TODO add more detail, need to store OS images in glance
Nova compute CPU mode
A virtual machine can only be live migrated to a hypervisor matching the same CPU. CloudVPS currently has multiple CPU models and is using the default "host-model" nova configuration.
To enable live migration between any production hypervisor, the cpu_mode parameter should match the lowest hypervisor CPU model.
Hypervisor range | CPU model | Launch date |
---|---|---|
cloudvirt[1023-1030].eqiad.wmnet | Gold 6140 Skylake | 2017 |
cloudvirt[1016-1022].eqiad.wmnet | E5-2697 v4 Broadwell | 2016 |
cloudvirt[1012-1014].eqiad.wmnet | E5-2697 v3 Haswell | 2014 |
cloudvirt[1001-1009].eqiad.wmnet | E5-2697 v2 Ivy Bridge | 2013 |
Virtual Machine Images
Important: Using QCOW2 for hosting a virtual machine disk is NOT recommended. If you want to boot virtual machines in Ceph (ephemeral backend or boot from volume), please use the raw image format within Glance.
Once all CloudVPS virtual machines have been migrated to Ceph we can convert the existing virtual machine images in Glance from QCOW2 to raw. This will avoid having nova-compute convert the image each time a new virtual machine is created.
VirtIO SCSI devices
Currently CloudVPS virtual machines are configured with the virtio-blk driver. This driver does not support discard/trim operations to free up deleted blocks.
Discard support can be enabled by using the virtio-scsi driver, but it's important to note that the device labels will change from /dev/vda to /dev/sda.
Migrating VMs from local storage to Ceph
NOTE: This process is working but requires more testing and verification
Switch puppet roles to the Ceph enabled wmcs::openstack::eqiad1::virt_ceph role. In operations/puppet/manifest/site.pp:
node 'cloudvirt1022.eqiad.wmnet' { role(wmcs::openstack::eqiad1::virt_ceph) }
Run the puppet agent on the hypervisor
hypervisor $ sudo puppet agent -tv
Shutdown the VM
cloudcontrol $ openstack server stop <UUID>
Convert the local QCOW2 image to raw and upload to Ceph
hypervisor $ qemu-img convert -f qcow2 -O raw /var/lib/nova/instances/<UUID>/disk rbd:compute/<UUID>_disk:id=eqiad1-compute
Undefine the virtual machine. This command removes the existing libvirt definition from the hypervisor, once nova attempts to start the VM it will be redefined with the RBD configuration. (This step can be ignored, but you may notice some errors in nova-compute.log until the VM has been restarted)
hypervisor $ virsh undefine <OS-EXT-SRV-ATTR:instance_name>
Cleanup local storage files
hypervisor $ rm /var/lib/nova/instances/<UUID>/disk hypervisor $ rm /var/lib/nova/instances/<UUID>/disk.info
Power on the VM
cloudcontrol $ openstack server start <UUID>
Back-out plan, Migrating VMs from Ceph to local storage
Shutdown the VM
cloudcontrol $ openstack server stop <UUID>
Convert the Ceph raw image back to QCOW2 on the local hypervisor
hypervisor $ qemu-img convert -f raw -O qcow2 rbd:compute/<UUID>_disk:id=eqiad1-compute /var/lib/nova/instances/UUID/disk
Power on the VM
cloudcontrol $ openstack server start <UUID>
Cloudvirt Drive Audit
- OpenStack total storage requested: 58.2 TB
- OpenStack total used 27.72 TB
Potential SSDs for relocation:
- 57 x Raw Size: 1.455 TB (Dell hosts)
- 58 x Raw Size: 1.746 TB (Dell hosts)
- 36 x Raw Size: 1.6 TB (HP hosts)
27 x replacement drives would need to be purchased for the operating system on the hosts drives were taken from
Raw data
some of the dell hosts are not properly reporting the drive models with megacli
cloudvirt[1001-1003].eqiad.wmnet
2 Size: 146 GB Rotational Speed: 15000 Model: HP EH0146FCBVB 16 Size: 300 GB Rotational Speed: 15000 Model: HP EH0300JDYTH
cloudvirt[1004-1005].eqiad.wmnet
1 Size: 146 GB Rotational Speed: 15000 Model: HP EH0146FBQDC 1 Size: 146 GB Rotational Speed: 15000 Model: HP EH0146FCBVB 16 Size: 300 GB Rotational Speed: 15000 Model: HP EH0300JDYTH
cloudvirt1006.eqiad.wmnet
2 Size: 146 GB Rotational Speed: 15000 Model: HP EH0146FCBVB 16 Size: 300 GB Rotational Speed: 15000 Model: HP EH0300JDYTH
cloudvirt1007.eqiad.wmnet
2 Size: 146 GB Rotational Speed: 15000 Model: HP EH0146FBQDC 1 Size: 300 GB Rotational Speed: 15000 Model: HP EH0300FCBVC 15 Size: 300 GB Rotational Speed: 15000 Model: HP EH0300JDYTH
cloudvirt[1008-1009].eqiad.wmnet
2 Size: 146 GB Rotational Speed: 15000 Model: HP EH0146FBQDC 16 Size: 300 GB Rotational Speed: 15000 Model: HP EH0300JDYTH
cloudvirt[1012-1014].eqiad.wmnet
6 Size: 1600.3 GB Model: ATA LK1600GEYMV
cloudvirt[1015-1017].eqiad.wmnet
10 Size: 1.455 TB Model: SSDSC2BX016T4R G201DL2D
cloudvirt1018.eqiad.wmnet
7 Size: 1.455 TB Model: SSDSC2BX016T4R G201DL2D 2 Size: 1.746 TB Model: SSDSC2KG019T8R XCV1DL63 1 Size: 1.746 TB Model: XCV1DL61
cloudvirt[1021-1022].eqiad.wmnet
10 Size: 1.455 TB Model: DAC9
cloudvirt1023.eqiad.wmnet
10 Size: 1.746 TB Model: SCV1DL58
cloudvirt1024.eqiad.wmnet
5 Size: 1.746 TB Model: SCV1DL58 1 Size: 1.746 TB Model: SSDSC2KB019T8R XCV1DL63 1 Size: 1.746 TB Model: XCV1DL61
cloudvirt[1025-1030].eqiad.wmnet
6 Size: 1.746 TB Model: HE56
CloudVPS Sizing Recommendations
When reviewing the sizing guides in the community keep in mind the types of drives and their capabilities.
Sizing recommendations using the results from our internal testing and community guidelines:
Type | Bus | Bandwidth (AVG) | CPU Recommendations | RAM Recommendations |
---|---|---|---|---|
HDD | SATA 6Gb/s | 150MB/s | Xeon Silver 2GHz + 1 cpu core = 2 core-GHz per HDD | 1GB for 1TB of storage |
SSD* | SATA 6Gb/s | 450MB/s | Xeon Silver 2GHz + 2 cpu cores = 4 core-GHz per SSD | 1GB for 1TB of storage |
NVME | M.2. PCIe 32 Gb/s | 3000MB/s | Xeon Gold 2GHz * 5 cpu cores * 2 sockets = 20 core-GHz per NVME | 2GB for each OSD |
* POC cluster is equipped with SATA SSD drives
In addition to the baseline CPU requirements, it's necessary to include additional CPU and RAM for the operating system and Ceph rebuilding, rebalancing and data scrubbing.
Next Phase Ceph OSD Server recommendation:
PowerEdge R440 Rack Server
- 2 x Xeon Silver 4214 CPU 12 cores / 24 threads
- 2 x 32GB RDIMM
- 2 x 240GB SSD SATA (OS Drive)
- 8 x 1.92TB SSD SATA (Data Drive)
- 2 x 10GB NIC
- No RAID (JBOD Only)
Estimated 15 OSD servers to provide enough storage capacity for existing virtual machine disk images and block devices.
CLI examples
Create, format and mount a RBD image (useful for testing / debugging)
$ rbd create datatest --size 250 --pool compute --image-feature layering $ rbd map datatest --pool compute --name client.admin $ mkfs.ext4 -m0 /dev/rbd0 $ mount /dev/rbd0 /mnt/
$ umount /mnt $ rbd unmap /dev/rbd0 $ rbd rm compute/datatest
List RBD nova images
$ rbd ls -p compute 9e2522ca-fd5e-4d42-b403-57afda7584c0_disk
Show RBD image information
$ rbd info -p compute 9051203e-b858-4ec9-acfd-44b9e5c0ecb1_disk rbd image '9051203e-b858-4ec9-acfd-44b9e5c0ecb1_disk': size 20 GiB in 5120 objects order 22 (4 MiB objects) snapshot_count: 0 id: aec56b8b4567 block_name_prefix: rbd_data.aec56b8b4567 format: 2 features: layering, exclusive-lock, object-map, fast-diff, deep-flatten op_features: flags: create_timestamp: Mon Jan 6 21:36:11 2020 access_timestamp: Mon Jan 6 21:36:11 2020 modify_timestamp: Mon Jan 6 21:36:11 2020
View RBD image with qemu tools on a hypervisor
$ qemu-img info rbd:<pool>/<vm uuid>_disk:id=<ceph user>
Community best practice notes
- Jan 23, 2019 Cloud storage performance at CERN https://indico.cern.ch/event/755842/contributions/3243386/attachments/1784159/2904041/2019-jcollet-openlab.pdf
- May 22, 2018 How to survive an OpenStack Cloud Meltdown with Ceph http://people.redhat.com/~flucifre/talks/How%20to%20Survive%20an%20OpenStack%20Cloud%20Meltdown%20with%20Ceph%20-%20Vancouver%20Summit%202018.pdf
- Nov 27, 2016 The dos and donts for ceph on openstack https://ceph.io/planet/the-dos-and-donts-for-ceph-for-openstack/
- Aug 2, 2017 more recommendations for ceph and openstack https://ceph.io/planet/more-recommendations-for-ceph-and-openstack/
Resources
- Ceph releases https://docs.ceph.com/docs/master/releases/
- Debian Buster Ceph packages versions https://packages.debian.org/buster/ceph
- Ceph mailing list discussion on Debian (stretch) packaging http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-June/027081.html
- Ceph block devices and openstack https://docs.ceph.com/docs/master/rbd/rbd-openstack/
- Ceph prometheus module https://docs.ceph.com/docs/master/mgr/prometheus/
- CephFS and Ganesha https://docs.ceph.com/docs/master/cephfs/nfs/
- Rook.io advanced Ceph configuration https://docs.ceph.com/docs/master/cephfs/nfs/