Getting Started With Ceph

Posted on Jul 9, 2024

Ceph Introduction

Ceph has evolved a lot after its birth at 2007, habing important milestones like RedHat and later on becomes IBM after they acquired RedHat. The users and administrators may have know a lot about it, but these milestones probably take the software a lot further that it was thought to be.

After the IBM era, Ceph is aimed to have one major relase per year, keep the last 2 supported and expire the rest.

One of the biggest advantages of Ceph is that it is horizontally scalable or in brief it has a scale out architecture. There are also several softwares serving storage under different orientation, such as MinIO, GlusterFS, Open ZFS, DRDB, Lustre etc. They all have pros and cons against each other. Of course this list can be extended including the proprietary softwares and appliances, but here we are mostly touching to opensource softwares.

Since Ceph has a wide storage offerings, it can be used in many different sectors and markets for variaty of purposes such as a Storage backend for virtualization and containerazation systems, databases, filesharing, backup and archiving, analytics, Media etc.

Ceph Components

Introduction to RADOS

RADOS, a core component of Ceph, which refers to Reliable Autonomous Distributed Object Store is a self healing and self managing intelligent storage node collection.

Ceph Access Methods

There are 3 main and 1 intermediary low level components above RADOS.

  1. RBD - RADOS Block Device
  2. RGW - RADOS Gateway
  3. CephFS - POSIX Filesystem
  4. librados

There is also an intermediary level component named librados, which can be accessed directly from apps, also does the communication in between RADOS to RBD and RGW, and back. CephFS is directly linked to RADOS.

RADOS Block Store

This is a component to act as a Block device and can be accessible for any kind of storage disk purpose like virtual machine disk, database disk etc. having classical snapshot, replication and consistency capabilities.

CephFS (POSIX Filesystem)

The Ceph filesystem (CephFS) is a distributed and almost POSIX compliant filesystem, at least more than NFS and HDFS but not as much as ext4 and xfs. There are some differences between CephFS and POSIX are well described here

RADOS Gateway

RADOS Gateway (RGW) is the access point to the object store residing on RADOS using S3 and/or Swift RESTful protocols. Both S3 and Swift are mostly compatible to its ancestors Amazon S3 API and OpenStack Swift API. All these protocols can utilize HTTP, HTTPS and FastCGI.

Object store has many advantages compared to Filesystem by means of scaling, analytics and having ability geographically distributed.

librados

C library that can work directly with RADOS. It is also used as an interface between RADOS to Object Gateway and Block Store.

Ceph MON Daemon

MON is the abbreviation of the Monitor. These nodes basically maintains the maps of the cluster states. It can be considered as a baseline for other daemon’s communication. The cluster map is the total of the maps that contain the cluster state and configuration.

The cluster map is composed of the following maps.

  1. The Monitor Map
  2. The OSD Map
  3. The PG Map
  4. The CRUSH Map
  5. The MDS Map

To be able to update apply updates to the cluster, MONs should be in consensus about the cluster state. In all states, MONs should satisfy the quorum (50% + 1), therefore best with a odd number of them is recommended. This is how MONs can satisfy majority for the voting about the cluster state. Usually 3 or 5 MONs are enough for a Ceph cluster.

monmap

MON nodes are reading the monmap and ceph.conf to identify the other MONs to communicate and establish the quorum. That’s the reason only when MONs fails, added or removed the monmap is updated.

Ceph MGR Daemon

MGR is the abbreviation of the Manager. This daemon is an auxiliary daemon to the MONs, mainly created to offload some roles from them, where it provides additional monitoring, and communicates with external management and monitoring systems.

This has become an crucial component to Ceph cluster as it is required for some widely used Ceph monitoring commands like ceph osd df and ceph status.

It is important to understand that Ceph MGR doesn’t have a direct affect on the IO, but querying the cluster statistics may fail in high load or downtime. Not to fall into such a scenario, it is recommended to have a redundant setup

Ceph OSD Daemon

OSD is the abbreviation of the Object Storage Device. These are the building blocks of the Ceph cluster. OSDs can virtually connect a storage device on the local system, which can be a local hardrive, ssd or a external Storage LUN etc, to the Ceph cluster. A single Ceph node can have multiple OSD daemons running, meaning multiple disks from each storage node can be connected to the cluster. OSD operations are aimed to be highly performant. Using the CRUSH (Controlled Replication Under Scalable Hashing) algorithm, both Ceph clients and OSD daemons can process the information of the object location, instead of adding an extra layer to check from a lookup table. In the recent versions of Ceph, the recommended way to store the object data is using BlueStore instead of the FileStore. A nice comparison can be found here.

CRUSH Map

CRUSH which refers to Controlled Replication Under Scalable Hashing, is processing the data storage locations and with that determines how to write and read the data. Ceph clients has capabilities to interact with the OSDs by help of CRUSH, also adding additional layer to avoid single point of failure, performance bottlenecks etc. CRUSH is orchestrating the traffic of the data flow and assigns every object to a Placement Group (PG). Long story short, PGs are the virtual layer between the application layer and the physical layer, so called objects and OSDs.

MDS Daemon

MDS is the abbreviation of the Metadata Server. To be able to use CephFS, this daemon should be running as it manages the filesystem and can run in Active/Standby or Active/Active manner depending on the size of the cluster, and need in the number of metadata operations. Metadata should always be stored in fast drives like SSDs or NVMes, and seperated from the data itself.

Ceph RGW Daemon

As mentioned above, this daemon is responsible for the Object gateway service, the object traffic and api calls of S3 and Swift protocols. It is always recommended to have a load balancer in front of the RADOS Gateway to balance the frontend load.