16 March 2016

Here’s my checklist of ceph performance tuning. It can be used for deployment or performance troubleshooting. I borrowed from the great framework posted by RaySun.

Hardware Layer

About the server nodes

  • Choose proper CPU, memory (e.g. frequency, size, etc) for different work nodes of Ceph, such as OSD, MON, MDS. If you use erasure code, it needs more CPU resource.
  • Enable HT and VT in BIOS. Optionally shut off NUMA and power-saving.
  • Number of nodes
  • How many disks for each storage node

About the storage

  • Choose proper storage for example disk rotation rate, disk internface (SAS, SATA), or SSD, with respect to cost/GB vs throughput vs latency.
  • Use SSD for journal
  • Choose between using JBOD (recommended) or RAID, local disk (recommended) or SAN.
  • Whether or to use what HBA/RAID card.
  • RAID card use write-through or write-back. Whether it has battery or capacitor

About the network

  • NIC card count and bandwidth, for different type of Ceph work nodes.
  • Enable jumbo frame if your switch supports it (MTU 9000 instead of 1500)
  • The bandwidth of internal cluster network should be no less than 10Gb.

OS Layer

  • Enable NTP time synchronization. Ceph is sensitive to time.
  • It is recommend to put OS, OSD, journal each in a differnet disk, io at least a different partition
  • Make pid max and file limit large enough.
  • Set vm.swappiness to zero. Enable
  • Enable kernel read_ahead.
  • Set the kernel block IO scheudler, noop for SSD, deadline for SATA/SAS disks. Increase the block IO queue size.
  • Shut-off disk controller cache, because it doesn’t have battery/capacitor to protect from power outage.

Block Caching Layer

  • Use bcache, or LVM cache.

Filesystem Layer

Ceph Layer

  • OSD per disk. Monitor on separated node.
  • Put journal in separated OSD disk if you can.
  • CGroup pin each OSD to its CPU core/socket (To avoid NUMA issues).
  • Proper PG count. Briefly, PGs = round2((Total_number_of_OSD * 100) / max_replication_count). See pgcalc.
  • Scrubbing, if enabled, may severely impact performance.
  • Enable tcmalloc and adjust max thread cache, see hustcat’s blog.
  • Choose to use erasure code or replica.
  • Enlarge almost everything in Ceph config: max open files, buffer sizes, flush intervals, … a lot. See RaySun’s blog.
  • Increase redundant parallel reads with erasure coding. Recovery throttling. Enable bucket sharding. See Yahoo’s.
  • OSD requires about 1 GB memory for per 1TB storage.
  • CRUSH map configurations to improve reliability by reducing number of copysets. See UStack’s blog and this paper.
  • Enable RBD Cache. See link

Benchmarking Tools

Troubleshooting Cases

  • Rebalancing, if currently carrying on, may severely impact performance.
  • If a disk is broken or deteriorated, the performance of whole cluster may be severely impacted.
  • If the snapshot chain is too long, it may becomes slow.
  • RAID card failure results in great IOPS decrease, see this blog.

References



Create an Issue or comment below