Setting Up a Ceph Filesystem with Kubernetes on DigitalOcean

Recently my company created an application for managing 3D printing projects, profiles, and slices. Check it out at layerkeep.com.

We wanted users to be able to keep track of all their file revisions and also be able to manage the files without having to go through the browser. To accomplish this, we decided to use Git which meant we needed a scalable filesystem.

The first thing we did is setup a Kubernetes cluster on DigitalOcean.

Currently DigitalOcean only provides Volumes that are ReadWriteOnce. Since we have multiple services that need access to the files (api, nginx, slicers), we needed to be able to mount the same volume with ReadWriteMany.

I decided to try s3fs with DigitalOcean Spaces since they are S3-compatible object stores. I setup the CSI from https://github.com/ctrox/csi-s3. I tried both the s3fs and goofys mounter. Both worked and both were way too slow. Most of our APIs require accessing the filesystem multiple times and each access took between 3-15 seconds so I moved on to Ceph.

Ceph Preparation:

There is a great storage manager called Rook (https://rook.github.io/) that can be used to deploy many different storage providers to Kubernetes.
** Kubernetes on DigitalOcean doesn’t support FlexVolumes so you need to use CSI instead.

Hardware Requirements.
You can check the ceph docs to see what you might need. http://docs.ceph.com/docs/jewel/start/hardware-recommendations/#minimum-hardware-recommendations

Create the Kubernetes Cluster

Follow the directions here to create the cluster: https://www.digitalocean.com/docs/kubernetes/how-to/create-clusters/

** Initially I tried a 3 node pool each with 1 CPU and 2GB of memory but it wasn’t enough. It needed more CPU on startup. I changed each node to have 2 CPUs and 2 GBs of memory which worked.

We’ll make sure to keep all ceph services constrained to this pool by naming it “storage-pool” (or whatever name you want) and adding a node affinity using that name later.

Cluster Access

Make sure you followed DO directions to accessing the Cluster with kubectl. (https://www.digitalocean.com/docs/kubernetes/how-to/connect-to-cluster/)

You also might want to add a Kubernetes Dashboard. (https://github.com/kubernetes/dashboard)

SSH:
Right now it doesn’t look like you can ssh into the droplets that DigitalOcean creates when you create a node pool. I wanted to have access just in case so I went to the droplets section and reset the root password for each of them. I was then able to add my ssh key and disable root login. I recommend doing this before adding any services.

Create Volumes
Go to the Volumes section in DigitalOcean dashboard. We want to create a volume for each node in the node pool we just created. Don’t format it. Then attach it to the correct droplet. Remember that volumes can only be increased in size (not decreased) without having to create a new one.

Create the Ceph Cluster

Clone down the Rook repository or just copy down the ceph directory from: https://github.com/rook/rook/tree/release-1.0/cluster/examples/kubernetes/ceph

cd cluster/examples/kubernetes/ceph

Modify the cluster.yaml file.

This is where we’ll add the node affinity to run the ceph cluster only on nodes with the “storage-pool” name.

placement:
  all:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
        - key: doks.digitalocean.com/node-pool
          operator: In
          values:
          - storage-pool
    podAffinity:
    podAntiAffinity:
    tolerations:
    - key: storage-pool
      operator: Exists

There are also other configs that are commented out that you might need to change. For example, if your disks are smaller than 100 GB you’ll need to uncomment the ‘databaseSizeMB: “1024”‘.

Modify the filesystem.yaml file if you want. (Filesystem Design)
Once you’re done configuring you can run:

kubectl apply -f ceph/common.yaml kubectl apply -f ceph/csi/rbac/cephfs/ kubectl apply -f ceph/filesystem.yaml kubectl apply -f ceph/operator-with-csi.yaml kubectl apply -f ceph/cluster.yaml

If you want the ceph dashboard you can run:
kubectl apply -f ceph/dashboard-external-https.yaml

Your operator should create your cluster. You should see 3 managers, 3 monitors, and 3 osds. Check here for issues: https://rook.github.io/docs/rook/master/ceph-common-issues.html

Deploy the CSI

https://rook.github.io/docs/rook/master/ceph-csi-drivers.html

We need to create a secret to give the provisioner permission to create the volumes.

To get the adminKey we need to exec into the operator pod. We can print it out in one line with:

POD_NAME=$(kubectl get pods -n rook-ceph | grep rook-ceph-operator | awk '{print $1;}'); kubectl exec -it $POD_NAME -n rook-ceph ceph auth get-key client.admin

Create a secret.yaml file:

apiVersion: v1
kind: Secret
metadata:
  name: csi-cephfs-secret
  namespace: default
data:
  # Required if provisionVolume is set to true
  adminID: admin
  adminKey: {{ PUT THE RESULT FROM LAST COMMAND }}

Create the CephFS StorageClass.

We’ll need to modify the example storageclass in ceph/csi/example/cephfs/storageclass.yaml.

The storageclass.yaml file should look like:

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: csi-cephfs
provisioner: cephfs.csi.ceph.com
parameters:
  # Comma separated list of Ceph monitors
  # if using FQDN, make sure csi plugin's dns policy is appropriate.
  monitors: rook-ceph-mon-a.rook-ceph:6789,rook-ceph-mon-b.rook-ceph:6789,rook-ceph-mon-c.rook-ceph:6789

  # For provisionVolume: "true":
  # A new volume will be created along with a new Ceph user.
  # Requires admin credentials (adminID, adminKey).
  # For provisionVolume: "false":
  # It is assumed the volume already exists and the user is expected
  # to provide path to that volume (rootPath) and user credentials (userID, userKey).
  provisionVolume: "true"

  # Ceph pool into which the volume shall be created
  # Required for provisionVolume: "true"
  pool: myfs-data0

  # The secrets have to contain user and/or Ceph admin credentials.
  csi.storage.k8s.io/provisioner-secret-name: csi-cephfs-secret
  csi.storage.k8s.io/provisioner-secret-namespace: default
  csi.storage.k8s.io/node-stage-secret-name: csi-cephfs-secret
  csi.storage.k8s.io/node-stage-secret-namespace: default

reclaimPolicy: Retain
allowVolumeExpansion: true

Change the storage class name to whatever you want.

*If you changed metadata.name in filesystem.yaml to something other than “myfs” then make sure you update the pool name here.

Create the PVC:

Remember that Persistent Volume Claims are accessible only from within the same namespace.

kind: PersistentVolumeClaim
apiVersion: v1
metadata:
  name: my-pv-claim
spec:
  storageClassName: csi-cephfs
  accessModes:
  - ReadWriteMany
  resources:
    requests:
      storage: 10Gi

Use the Storage

Now you can mount your volume using the persistent volume claim you just created in your Kubernetes resource. An example Deployment is:

apiVersion: apps/v1
kind: Deployment
metadata:
 name: webserver
 namespace: default
 labels:
   k8s-app: webserver
spec:
 replicas: 2
 selector:
   matchLabels:
     k8s-app: webserver
 template:
   metadata:
     labels:
       k8s-app: webserver
   spec:
     containers:
     - name: web-server
       image: nginx
       volumeMounts:
       - name: my-persistent-storage
         mountPath: /var/www/assets
     volumes:
     - name: my-persistent-storage
       persistentVolumeClaim:
         claimName: my-pv-claim

Both deployment replicas will have access to the same data inside /var/www/assets.

ADDITIONAL TOOLS

You can also test and debug the filesystem using the Rook toolbox. (https://rook.io/docs/rook/v1.0/ceph-toolbox.html).

First start the toolbox with: kubectl apply -f ceph/toolbox.yaml

Shell into the pod.

TOOL_POD=$(kubectl get pods -n rook-ceph | grep tools | head -n 1 | awk '{print $1;}'); kubectl exec -it $TOOL_POD -n rook-ceph /bin/bash

Run Ceph commands: http://docs.ceph.com/docs/giant/rados/operations/control/

Validate the filesystem is working by mounting it directly into the toolbox pod.
From: https://rook.io/docs/rook/v1.0/direct-tools.html

# Create the directory
mkdir /tmp/registry

# Detect the mon endpoints and the user secret for the connection
mon_endpoints=$(grep mon_host /etc/ceph/ceph.conf | awk '{print $3}')
my_secret=$(grep key /etc/ceph/keyring | awk '{print $3}')

# Mount the file system
mount -t ceph -o mds_namespace=myfs,name=admin,secret=$my_secret $mon_endpoints:/ /tmp/registry

# See your mounted file system
df -h

Try writing and reading a file to the shared file system.

echo "Hello Rook" > /tmp/registry/hello
cat /tmp/registry/hello

# delete the file when you're done
rm -f /tmp/registry/hello

Unmount the Filesystem

To unmount the shared file system from the toolbox pod:

umount /tmp/registry
rmdir /tmp/registry

No data will be deleted by unmounting the file system.

Monitoring

Now that everything is working you should add monitoring and alerts.

You can add the Ceph dashboard and/or Prometheus/Grafana to monitor your filesystem.
http://docs.ceph.com/docs/master/mgr/dashboard/
https://github.com/rook/rook/blob/master/Documentation/ceph-monitoring.md