Ceph™ is an Open Source SDS solution based on the idea of storing data as objects. Importantly Ceph™ objects should not be considered objects in the sense as objects in a database, programming language or RGW (see Glossary). An object in Ceph™ consists of the object ID, binary data and metadata stored in key-value pairs. Significantly, the object ID is unique in the entire cluster.
Ceph™ itself implements two revolutionary ideas which require a brief introduction:
- CRUSH – Controlled Replication Under Scalable Hashing. This is an algorithm which allows for the placement of specific objects in calculated places (in the case of Ceph™ on the OSD, that is the “disk” [the OSD function may be played by a disk, LVM, BlueStore or even an ordinary file]). Thanks to this approach, a client who knows the state (map) of the cluster can independently calculate where a particular object is stored. This prevents bottlenecks like a lookup service/server
- RADOS – Reliable Autonomic Distributed Object Storage. This is a set of algorithms and global approaches to topics such as data access, data redundancy, failure detection, and cluster/data recovery. RADOS can operate on clusters containing thousands of devices and hosts.
Some of the most essential functions of RADOS include:
- use of CRUSH to determine the place (OSD) to store objects
- direct communication between the client and server (OSD) storing the object in question
- simultaneous independent operation on all “disks” (OSD)
- simultaneous independent access for many clients
- automatic data protection
- no metadata storage service/server
- no centralised lookup server/service
- automatic, autonomous (independent) reaction to changes in the state of the cluster, as well as to accidents.
A unique feature of Ceph™ is its ability to provide several different types of mass storage. Moreover, different types of mass storage can be provided simultaneously by a single cluster.
The first of these is a block device, that is a type of device with free access (in the command
ls -l block devices are marked with the letter b). Typical block devices include disks (both physical and virtual disks). The block devices used by Ceph™ may also be used by the kernel driver and the FUSE driver. In Ceph™ terminology, a block device provided in this way is an RBD (Rados Block Device).
Another popular application for Ceph™ is to use it as mass storage for objects (simply speaking files) by using the Swift protocol or a protocol compatible with Amazon™ S3. In this way, Ceph™ functions as a backend for storing objects shared via a service using the HTTP protocol. The use of Ceph™ in this manner is called Ceph™ Object Gateway, Rados Gateway, RADOSGW or RGW.
It’s worth noting a particular problem with the Amazon™ S3 protocol here. Despite being unquestionably the standard among developers, it does not possess its own standard, that is, it has not been described by any RFC or other similar document. In addition, its internal mechanisms are not publicly known. Ceph™, while mimicking the access methods and function of S3, operates mainly on publicly known assumptions which Amazon™ can freely modify.
The penultimate type of access to Ceph™ is its use as an installed and shared file system. This method is called simply CephFS (Ceph™ FileSystem). A cluster which is meant to serve the file system in this way, it must have an additional service running which is responsible for storing metadata files. It is worth remembering that CephFS slightly differs from the POSIX standard.
The last way to use a Ceph™ cluster is with a librados library, which allows applications direct access to RADOS. This is the least used method when thinking of direct application of a distributed, scalable, high performance mass object storage system, due to the necessity of writing it directly into the application. Nonetheless, this is also the method which has the potential to achieve the highest performance. Solutions which provide RBD and RGW use librados, among others.
Finally, it should be stressed that a Ceph™ cluster can support all 4 types of access at the same time. This is because of its use of pools, meaning that generally speaking each of the applications has its own pool.