Ceph Performance Tuning Checklist

16 March 2016

Here’s my checklist of ceph performance tuning. It can be used for deployment or performance troubleshooting. I borrowed from the great framework posted by RaySun.

Hardware Layer

About the server nodes

Choose proper CPU, memory (e.g. frequency, size, etc) for different work nodes of Ceph, such as OSD, MON, MDS. If you use erasure code, it needs more CPU resource.
Enable HT and VT in BIOS. Optionally shut off NUMA and power-saving.
Number of nodes
How many disks for each storage node

About the storage

Choose proper storage for example disk rotation rate, disk internface (SAS, SATA), or SSD, with respect to cost/GB vs throughput vs latency.
Use SSD for journal
Choose between using JBOD (recommended) or RAID, local disk (recommended) or SAN.
Whether or to use what HBA/RAID card.
RAID card use write-through or write-back. Whether it has battery or capacitor

About the network

NIC card count and bandwidth, for different type of Ceph work nodes.
Enable jumbo frame if your switch supports it (MTU 9000 instead of 1500)
The bandwidth of internal cluster network should be no less than 10Gb.

OS Layer

Enable NTP time synchronization. Ceph is sensitive to time.
It is recommend to put OS, OSD, journal each in a differnet disk, io at least a different partition
Make pid max and file limit large enough.
Set vm.swappiness to zero. Enable
Enable kernel read_ahead.
Set the kernel block IO scheudler, noop for SSD, deadline for SATA/SAS disks. Increase the block IO queue size.
Shut-off disk controller cache, because it doesn’t have battery/capacitor to protect from power outage.

Block Caching Layer

Use bcache, or LVM cache.

Filesystem Layer

FS type: XFS or BTRFS or EXT4 (XFS is recommended, BTFS is good but not production ready)
FS block size, and inode size, inode count. Beware of file count vs average file size.
FS parameters: set noatime, nobarrier
Larger FS journal size
If SSD, add discard/trim to FS parameter
Ensure that all file inode, file name descriptors, metadata info are cached in memory. See link

Ceph Layer

OSD per disk. Monitor on separated node.
Put journal in separated OSD disk if you can.
CGroup pin each OSD to its CPU core/socket (To avoid NUMA issues).
Proper PG count. Briefly, PGs = round2((Total_number_of_OSD * 100) / max_replication_count). See pgcalc.
Scrubbing, if enabled, may severely impact performance.
Enable tcmalloc and adjust max thread cache, see hustcat’s blog.
Choose to use erasure code or replica.
Enlarge almost everything in Ceph config: max open files, buffer sizes, flush intervals, … a lot. See RaySun’s blog.
Increase redundant parallel reads with erasure coding. Recovery throttling. Enable bucket sharding. See Yahoo’s.
OSD requires about 1 GB memory for per 1TB storage.
CRUSH map configurations to improve reliability by reducing number of copysets. See UStack’s blog and this paper.
Enable RBD Cache. See link

Benchmarking Tools

Ceph perf counter, which is embedded in code
Benchmark commands: rados bench, iperf, dd, fio, cbt, ceph osd perf. See Ceph wiki.
Tracking commands: top, iowait, iostat, blktrace, debugfs.
Watch for “slow xxx” in ceph’s log.
Project CeTune the Ceph profiling and tuning framework.
Linux Performance Analysis in 60,000 Milliseconds and Netflix at Velocity 2015: Linux Performance Tools

Troubleshooting Cases

Rebalancing, if currently carrying on, may severely impact performance.
If a disk is broken or deteriorated, the performance of whole cluster may be severely impacted.
If the snapshot chain is too long, it may becomes slow.
RAID card failure results in great IOPS decrease, see this blog.

References

Ceph性能优化总结(v0.94)
One Ceph, Two ways of thinking
几个 Ceph 性能优化的新方法和思路（2015 SH Ceph Day 参后感）
Scheduler queue size and resilience to heavy IO
Ceph性能调优——Journal与tcmalloc
打造高性能高可靠块存储系统
linux系统数据落盘之细节
海量小文件存储与Ceph实践
Linux Performance Analysis in 60,000 Milliseconds
[Netflix at Velocity 2015](Linux Performance Tools](http://techblog.netflix.com/2015/08/netflix-at-velocity-2015-linux.html)
EXT4 File-System Tuning Benchmarks
xfs文件系统使用总结
[Ceph](Open Source Storage Software Optimizations on Intel® Architecture for Cloud Workloads](http://www.slideshare.net/LarryCover/ceph-open-source-storage-software-optimizations-on-intel-architecture-for-cloud-workloads)
Yahoo Cloud Object Store - Object Storage at Exabyte Scale
BENCHMARK CEPH CLUSTER PERFORMANCE
Ceph Benchmarks
优化的重点：针对Ceph的七剑
Ceph OSD Hardware - A Pragmatic Guide

Ceph 4

Create an Issue or comment below