ceph recommendation - hardware

简介:
CEPH的主要组件mon, osd, mds的硬件建议配置.

一. CPU建议
mds节点, 需要负责数据计算, 负载均衡, 比较依赖CPU, 建议4核或以上CPU.
OSD节点, 运行RADOS服务, 通过CRUSH算法计算数据存储的placement group, 复制, 恢复, 维护cluster map 的osd map, 也比较依赖CPU, 建议2核或以上CPU.
MON节点, 维护ceph cluster map, 不依赖CPU, 1核心就够了.

二. RAM建议
mds, mon节点, 每后台进程 至少需要1GB内存(例如一个主机跑多个mds, 则需要多个GB) .
osd节点, 建议为当前osd节点每TB 存储 分配1GB内存. (如当前节点有10TB, 那么建议分配10GB内存).
一般来说内存越多越好.

三. 存储建议
Since Ceph has to write all data to the journal before it can send an ACK (for XFS and EXT4 at least), having the journal and OSD performance in balance is really important!
There are also file system limitations to consider: btrfs is not quite stable enough for production, but it has the ability to journal and write data simultaneously, whereas XFS and ext4 do not.
Ceph must write to the journal before it can ACK the write. The btrfs filesystem can write journal data and object data simultaneously, whereas XFS and ext4 cannot.

OSD
建议操作系统和OSD的硬盘物理隔离, 建议为每个osd配置独立的硬盘, (如1个系统中起多个osd进程的话, 建议为每个进程指定不同的物理硬盘), 建议osd的journal和data使用不同的物理硬盘存储.
例如1个主机跑了2个osd daemon, 那么建议至少5个物理硬盘, 1个用于os, 2个分别用于osd journal, 2个分别用于osd data.

SSD
When evaluating SSDs, it is important to consider the performance of sequential reads and writes. An SSD that has 400MB/s sequential write throughput may have much better performance than an SSD with 120MB/s of sequential write throughput when storing multiple journals for multiple OSDs.
journal对写操作响应要求较高, 直接影响CEPH的写性能, 建议journal放在性能较好的SSD.
做法 : 
将journal目录( 如 /var/lib/ceph/osd/$cluster-$id/journal. )挂载到ssd分区. 
另外需要注意, 民用SSD的写性能可能还不如企业机械盘, 建议 采用PCI-E接口的SSD.  (未来可能会有更快的接口如DIMM接口的SSD)
除了要考虑IOPS, 还需要考虑吞吐量(如连续写入速度), 建议 采用PCI-E接口的SSD.  (未来可能会有更快的接口如DIMM接口的SSD)
SSD还需要考虑对齐的问题, 参考 : 

MDS
为了提升CEPHFS的性能, 建议metadata和content分开存储在不同的osd pool.
参考

控制器
如果使用共享存储, 或者本地硬盘使用了RAID卡控制器的话, 注意控制器不要成为瓶颈, 一般控制器的瓶颈可能在吞吐量这块.

其他建议
1. 如果单个主机运行多个osd daemon的话, 
建议修改内核参数kernel.pid_max=4194303, 调大.
建议glibc和内核版本符合ceph的建议.
2. 网络带宽和磁盘带宽的匹配, 如果单节点运行较多的OSD  daemon, 并且磁盘带宽总量超过网络带宽的话, 网络就是吞吐量瓶颈, 此时建议增加网络带宽, 如增加网卡做端口绑定, 或使用更高传输速率的网卡.
 
四. 网络建议
ceph集群对外提供读写存储服务, 为了保证可靠性, 数据要存成多份, 同时为了提高扩展性, 还提供数据重分布的功能, 所以从功能上来说, 应该至少提供两个网络, 一个对外(包括外部程序对CEPH的读写, 监控), 一个对内(包括ceph自身的数据复制, 数据重分布, failover replica等).
那么需要多大的带宽呢?
1. 对外的网络决定了用户可以跑到多大的带宽, 即吞吐量. 如果一个主机运行了多个OSD的话,  磁盘带宽总量可能超过网络带宽的话, 网络就是吞吐量瓶颈, 此时建议增加网络带宽, 如增加网卡做端口绑定, 或使用更高传输速率的网卡. 
2. 内部使用的网络, 用来跑数据复制, 例如我们配置了3份拷贝, 那么用户数据在ceph cluster中将有3份拷贝, 数据拷贝是走内部网络的. 另一方面, 当部分osd挂掉时, ceph cluster将进入degrade状态, 同时产生大量的拷贝, 使挂掉的osd上的数据重新回到多分拷贝的状态(active+clean), 因此osd挂掉会产生大量的内部拷贝流量.  假设挂掉的osd有1TB数据, 使用1GB网卡需要拷贝10000秒, 而使用10GB网卡则只需要1000秒. 

所以原则上来讲, 建议使用两张网, 同时带宽要足够大. 建议至少10GB. 有条件的话, 应该增加一张管理网络.

五. failover 域建议
ceph中, failover域指会导致osd挂掉的区域, 例如网络挂掉, OS挂掉, 进程挂掉.
应尽量缩小failover域. 
A failure domain is any failure that prevents access to one or more OSDs. That could be a stopped daemon on a host; a hard disk failure, an OS crash, a malfunctioning NIC, a failed power supply, a network outage, a power outage, and so forth. When planning out your hardware needs, you must balance the temptation to reduce costs by placing too many responsibilities into too few failure domains, and the added costs of isolating every potential failure domain.

六. 最小配置建议
Process Criteria Minimum Recommended
ceph-osd Processor
  • 1x 64-bit AMD-64
  • 1x 32-bit ARM dual-core or better
  • 1x i386 dual-core
RAM ~1GB for 1TB of storage per daemon
Volume Storage 1x storage drive per daemon
Journal 1x SSD partition per daemon (optional)
Network 2x 1GB Ethernet NICs
ceph-mon Processor
  • 1x 64-bit AMD-64/i386
  • 1x 32-bit ARM dual-core or better
  • 1x i386 dual-core
RAM 1 GB per daemon
Disk Space 10 GB per daemon
Network 2x 1GB Ethernet NICs
ceph-mds Processor
  • 1x 64-bit AMD-64 quad-core
  • 1x 32-bit ARM quad-core
  • 1x i386 quad-core
RAM 1 GB minimum per daemon
Disk Space 1 MB per daemon
Network 2x 1GB Ethernet NICs

七. 产品example.

A recent (2012) Ceph cluster project is using two fairly robust hardware configurations for Ceph OSDs, and a lighter configuration for monitors.

Configuration Criteria Minimum Recommended
Dell PE R510 Processor 2x 64-bit quad-core Xeon CPUs
RAM 16 GB
Volume Storage 8x 2TB drives. 1 OS, 7 Storage
Client Network 2x 1GB Ethernet NICs
OSD Network 2x 1GB Ethernet NICs
Mgmt. Network 2x 1GB Ethernet NICs
Dell PE R515 Processor 1x hex-core Opteron CPU
RAM 16 GB
Volume Storage 12x 3TB drives. Storage
OS Storage 1x 500GB drive. Operating System.
Client Network 2x 1GB Ethernet NICs
OSD Network 2x 1GB Ethernet NICs
Mgmt. Network 2x 1GB Ethernet NICs


[参考]
相关文章
|
9月前
|
运维 监控 网络协议
译|llustrated Guide to Monitoring and Tuning the Linux Networking Stack: Receiving Data
译|llustrated Guide to Monitoring and Tuning the Linux Networking Stack: Receiving Data
72 0
|
9月前
|
监控 Linux
译|Monitoring and Tuning the Linux Networking Stack: Receiving Data(一)
译|Monitoring and Tuning the Linux Networking Stack: Receiving Data
107 0
|
9月前
|
缓存 监控 网络协议
译|Monitoring and Tuning the Linux Networking Stack: Receiving Data(九)
译|Monitoring and Tuning the Linux Networking Stack: Receiving Data(九)
208 0
|
9月前
|
监控 Linux 数据处理
译|Monitoring and Tuning the Linux Networking Stack: Receiving Data(五)
译|Monitoring and Tuning the Linux Networking Stack: Receiving Data(五)
48 0
|
9月前
|
监控 Ubuntu Linux
译|Monitoring and Tuning the Linux Networking Stack: Receiving Data(二)
译|Monitoring and Tuning the Linux Networking Stack: Receiving Data(二)
63 0
|
9月前
|
监控 Linux 数据处理
译|Monitoring and Tuning the Linux Networking Stack: Receiving Data(四)
译|Monitoring and Tuning the Linux Networking Stack: Receiving Data(四)
74 0
|
9月前
|
存储 缓存 监控
译|Monitoring and Tuning the Linux Networking Stack: Receiving Data(六)
译|Monitoring and Tuning the Linux Networking Stack: Receiving Data(六)
82 0