ceph recommendation - hardware-阿里云开发者社区

ceph recommendation - hardware

2016-04-01 2383

版权

本文内容由阿里云实名注册用户自发贡献，版权归原作者所有，阿里云开发者社区不拥有其著作权，亦不承担相应法律责任。具体规则请查看《阿里云开发者社区用户服务协议》和《阿里云开发者社区知识产权保护指引》。如果您发现本社区中有涉嫌抄袭的内容，填写侵权投诉表单进行举报，一经查实，本社区将立刻删除涉嫌侵权内容。

简介：

CEPH的主要组件mon, osd, mds的硬件建议配置.

一. CPU建议

mds节点, 需要负责数据计算, 负载均衡, 比较依赖CPU, 建议4核或以上CPU.

OSD节点, 运行RADOS服务, 通过CRUSH算法计算数据存储的placement group, 复制, 恢复, 维护cluster map 的osd map, 也比较依赖CPU, 建议2核或以上CPU.

MON节点, 维护ceph cluster map, 不依赖CPU, 1核心就够了.

二. RAM建议

mds, mon节点, 每后台进程至少需要1GB内存(例如一个主机跑多个mds, 则需要多个GB) .

osd节点, 建议为当前osd节点每TB 存储分配1GB内存. (如当前节点有10TB, 那么建议分配10GB内存).

一般来说内存越多越好.

三. 存储建议

Since Ceph has to write all data to the journal before it can send an ACK (for XFS and EXT4 at least), having the journal and OSD performance in balance is really important!

There are also file system limitations to consider: btrfs is not quite stable enough for production, but it has the ability to journal and write data simultaneously, whereas XFS and ext4 do not.

Ceph must write to the journal before it can ACK the write. The btrfs filesystem can write journal data and object data simultaneously, whereas XFS and ext4 cannot.

OSD

建议操作系统和OSD的硬盘物理隔离, 建议为每个osd配置独立的硬盘, (如1个系统中起多个osd进程的话, 建议为每个进程指定不同的物理硬盘), 建议osd的journal和data使用不同的物理硬盘存储.

例如1个主机跑了2个osd daemon, 那么建议至少5个物理硬盘, 1个用于os, 2个分别用于osd journal, 2个分别用于osd data.

SSD

When evaluating SSDs, it is important to consider the performance of sequential reads and writes. An SSD that has 400MB/s sequential write throughput may have much better performance than an SSD with 120MB/s of sequential write throughput when storing multiple journals for multiple OSDs.

journal对写操作响应要求较高, 直接影响CEPH的写性能, 建议journal放在性能较好的SSD.

做法 :

将journal目录( 如 /var/lib/ceph/osd/$cluster-$id/journal. )挂载到ssd分区.

另外需要注意, 民用SSD的写性能可能还不如企业机械盘, 建议采用PCI-E接口的SSD. (未来可能会有更快的接口如DIMM接口的SSD)

除了要考虑IOPS, 还需要考虑吞吐量(如连续写入速度), 建议采用PCI-E接口的SSD. (未来可能会有更快的接口如DIMM接口的SSD)

SSD还需要考虑对齐的问题, 参考 :

http://blog.163.com/digoal@126/blog/static/16387704020135753630439/

MDS

为了提升CEPHFS的性能, 建议metadata和content分开存储在不同的osd pool.

参考

http://ceph.com/docs/master/rados/operations/crush-map/#placing-different-pools-on-different-osds

控制器

如果使用共享存储, 或者本地硬盘使用了RAID卡控制器的话, 注意控制器不要成为瓶颈, 一般控制器的瓶颈可能在吞吐量这块.

其他建议

1. 如果单个主机运行多个osd daemon的话,

建议修改内核参数kernel.pid_max=4194303, 调大.

建议glibc和内核版本符合ceph的建议.

2. 网络带宽和磁盘带宽的匹配, 如果单节点运行较多的OSD daemon, 并且磁盘带宽总量超过网络带宽的话, 网络就是吞吐量瓶颈, 此时建议增加网络带宽, 如增加网卡做端口绑定, 或使用更高传输速率的网卡.

四. 网络建议

ceph集群对外提供读写存储服务, 为了保证可靠性, 数据要存成多份, 同时为了提高扩展性, 还提供数据重分布的功能, 所以从功能上来说, 应该至少提供两个网络, 一个对外(包括外部程序对CEPH的读写, 监控), 一个对内(包括ceph自身的数据复制, 数据重分布, failover replica等).

那么需要多大的带宽呢?

1. 对外的网络决定了用户可以跑到多大的带宽, 即吞吐量. 如果一个主机运行了多个OSD的话, 磁盘带宽总量可能超过网络带宽的话, 网络就是吞吐量瓶颈, 此时建议增加网络带宽, 如增加网卡做端口绑定, 或使用更高传输速率的网卡.

2. 内部使用的网络, 用来跑数据复制, 例如我们配置了3份拷贝, 那么用户数据在ceph cluster中将有3份拷贝, 数据拷贝是走内部网络的. 另一方面, 当部分osd挂掉时, ceph cluster将进入degrade状态, 同时产生大量的拷贝, 使挂掉的osd上的数据重新回到多分拷贝的状态(active+clean), 因此osd挂掉会产生大量的内部拷贝流量. 假设挂掉的osd有1TB数据, 使用1GB网卡需要拷贝10000秒, 而使用10GB网卡则只需要1000秒.

所以原则上来讲, 建议使用两张网, 同时带宽要足够大. 建议至少10GB. 有条件的话, 应该增加一张管理网络.

五. failover 域建议

ceph中, failover域指会导致osd挂掉的区域, 例如网络挂掉, OS挂掉, 进程挂掉.

应尽量缩小failover域.

A failure domain is any failure that prevents access to one or more OSDs. That could be a stopped daemon on a host; a hard disk failure, an OS crash, a malfunctioning NIC, a failed power supply, a network outage, a power outage, and so forth. When planning out your hardware needs, you must balance the temptation to reduce costs by placing too many responsibilities into too few failure domains, and the added costs of isolating every potential failure domain.

六. 最小配置建议

Process	Criteria	Minimum Recommended
ceph-osd	Processor	1x 64-bit AMD-64 1x 32-bit ARM dual-core or better 1x i386 dual-core
	RAM	~1GB for 1TB of storage per daemon
	Volume Storage	1x storage drive per daemon
	Journal	1x SSD partition per daemon (optional)
	Network	2x 1GB Ethernet NICs
ceph-mon	Processor	1x 64-bit AMD-64/i386 1x 32-bit ARM dual-core or better 1x i386 dual-core
	RAM	1 GB per daemon
	Disk Space	10 GB per daemon
	Network	2x 1GB Ethernet NICs
ceph-mds	Processor	1x 64-bit AMD-64 quad-core 1x 32-bit ARM quad-core 1x i386 quad-core
	RAM	1 GB minimum per daemon
	Disk Space	1 MB per daemon
	Network	2x 1GB Ethernet NICs

七. 产品example.

A recent (2012) Ceph cluster project is using two fairly robust hardware configurations for Ceph OSDs, and a lighter configuration for monitors.

Configuration	Criteria	Minimum Recommended
Dell PE R510	Processor	2x 64-bit quad-core Xeon CPUs
	RAM	16 GB
	Volume Storage	8x 2TB drives. 1 OS, 7 Storage
	Client Network	2x 1GB Ethernet NICs
	OSD Network	2x 1GB Ethernet NICs
	Mgmt. Network	2x 1GB Ethernet NICs
Dell PE R515	Processor	1x hex-core Opteron CPU
	RAM	16 GB
	Volume Storage	12x 3TB drives. Storage
	OS Storage	1x 500GB drive. Operating System.
	Client Network	2x 1GB Ethernet NICs
	OSD Network	2x 1GB Ethernet NICs
	Mgmt. Network	2x 1GB Ethernet NICs

[参考]

1. http://ceph.com/docs/master/start/hardware-recommendations/

2. http://blog.163.com/digoal@126/blog/static/16387704020135753630439/

ceph recommendation - hardware

热门文章

最新文章

相关电子书