Ceph Performance Tuning Checklist
16 March 2016
Here’s my checklist of ceph performance tuning. It can be used for deployment or performance troubleshooting. I borrowed from the great framework posted by RaySun.
Hardware Layer
About the server nodes
- Choose proper CPU, memory (e.g. frequency, size, etc) for different work nodes of Ceph, such as OSD, MON, MDS. If you use erasure code, it needs more CPU resource.
- Enable HT and VT in BIOS. Optionally shut off NUMA and power-saving.
- Number of nodes
- How many disks for each storage node
About the storage
- Choose proper storage for example disk rotation rate, disk internface (SAS, SATA), or SSD, with respect to cost/GB vs throughput vs latency.
- Use SSD for journal
- Choose between using JBOD (recommended) or RAID, local disk (recommended) or SAN.
- Whether or to use what HBA/RAID card.
- RAID card use write-through or write-back. Whether it has battery or capacitor
About the network
- NIC card count and bandwidth, for different type of Ceph work nodes.
- Enable jumbo frame if your switch supports it (MTU 9000 instead of 1500)
- The bandwidth of internal cluster network should be no less than 10Gb.
OS Layer
- Enable NTP time synchronization. Ceph is sensitive to time.
- It is recommend to put OS, OSD, journal each in a differnet disk, io at least a different partition
- Make pid max and file limit large enough.
- Set vm.swappiness to zero. Enable
- Enable kernel read_ahead.
- Set the kernel block IO scheudler, noop for SSD, deadline for SATA/SAS disks. Increase the block IO queue size.
- Shut-off disk controller cache, because it doesn’t have battery/capacitor to protect from power outage.
Block Caching Layer
- Use bcache, or LVM cache.
Filesystem Layer
- FS type: XFS or BTRFS or EXT4 (XFS is recommended, BTFS is good but not production ready)
- FS block size, and inode size, inode count. Beware of file count vs average file size.
- FS parameters: set noatime, nobarrier
- Larger FS journal size
- If SSD, add discard/trim to FS parameter
- Ensure that all file inode, file name descriptors, metadata info are cached in memory. See link
Ceph Layer
- OSD per disk. Monitor on separated node.
- Put journal in separated OSD disk if you can.
- CGroup pin each OSD to its CPU core/socket (To avoid NUMA issues).
- Proper PG count. Briefly, PGs = round2((Total_number_of_OSD * 100) / max_replication_count). See pgcalc.
- Scrubbing, if enabled, may severely impact performance.
- Enable tcmalloc and adjust max thread cache, see hustcat’s blog.
- Choose to use erasure code or replica.
- Enlarge almost everything in Ceph config: max open files, buffer sizes, flush intervals, … a lot. See RaySun’s blog.
- Increase redundant parallel reads with erasure coding. Recovery throttling. Enable bucket sharding. See Yahoo’s.
- OSD requires about 1 GB memory for per 1TB storage.
- CRUSH map configurations to improve reliability by reducing number of copysets. See UStack’s blog and this paper.
- Enable RBD Cache. See link
Benchmarking Tools
- Ceph perf counter, which is embedded in code
- Benchmark commands:
rados bench
,iperf
,dd
,fio
,cbt
,ceph osd perf
. See Ceph wiki. - Tracking commands: top, iowait, iostat, blktrace, debugfs.
- Watch for “slow xxx” in ceph’s log.
- Project CeTune the Ceph profiling and tuning framework.
- Linux Performance Analysis in 60,000 Milliseconds and Netflix at Velocity 2015: Linux Performance Tools
Troubleshooting Cases
- Rebalancing, if currently carrying on, may severely impact performance.
- If a disk is broken or deteriorated, the performance of whole cluster may be severely impacted.
- If the snapshot chain is too long, it may becomes slow.
- RAID card failure results in great IOPS decrease, see this blog.
References
- Ceph性能优化总结(v0.94)
- One Ceph, Two ways of thinking
- 几个 Ceph 性能优化的新方法和思路(2015 SH Ceph Day 参后感)
- Scheduler queue size and resilience to heavy IO
- Ceph性能调优——Journal与tcmalloc
- 打造高性能高可靠块存储系统
- linux系统数据落盘之细节
- 海量小文件存储与Ceph实践
- Linux Performance Analysis in 60,000 Milliseconds
- [Netflix at Velocity 2015](Linux Performance Tools](http://techblog.netflix.com/2015/08/netflix-at-velocity-2015-linux.html)
- EXT4 File-System Tuning Benchmarks
- xfs文件系统使用总结
- [Ceph](Open Source Storage Software Optimizations on Intel® Architecture for Cloud Workloads](http://www.slideshare.net/LarryCover/ceph-open-source-storage-software-optimizations-on-intel-architecture-for-cloud-workloads)
- Yahoo Cloud Object Store - Object Storage at Exabyte Scale
- BENCHMARK CEPH CLUSTER PERFORMANCE
- Ceph Benchmarks
- 优化的重点:针对Ceph的七剑
- Ceph OSD Hardware - A Pragmatic Guide
Create an Issue or comment below