Pacemaker+corosync搭建双节点HA集群的可靠性验证-阿里云开发者社区

前一篇>中为确保共享资源的不被破坏，配置了3节点集群，本文想验证一下双节点时有什么风险。

Pacemaker的手册上也有描述，Pacemaker支持法定投票和资源抢占2种方式防止脑裂。法定投票的方式确实很可靠，但是至少需要3票。如果在主备双机以外再专门搞一台机器以满足法定投票要求似乎太浪费。因此一些商业集群软件(比如MSCS,RHCS)除了节点以外还引入的仲裁盘。这个仲裁盘也算上一票，加上2个节点，共3票，只要获得其中2票即可。

参考
http://www.adirectory.blog.com/2013/01/cluster-quorum-disk/

使用仲裁盘需要有共享存储，对许多企业用户来说共享存储是标配，所以也不需要额外的投资。
但是开源的Pacemaker就不依赖于共享存储，也就没有仲裁盘的说法。那么在Pacemaker上配置抢占资源如何呢？我们没找到配置的案例，但设想可以这样做：
提供一个文件服务器（比如NFS），在上面建一个文件作为锁文件。然后自己写一个RA（不知道有没有现成的这样的RA），它做的事情就是start的时候以写的方式打开这个文件，每次的monitor操作就是往文件里写一个值再读出来。把这个RA资源作为其他资源的依赖资源，万一发生脑裂谁抢到这个资源谁就是主节点。
这个方法有个问题，就是这个资源成了单点故障点，万一文件服务器挂了，HA集群也起不来了。
(当然，如果HA的业务本身天生有一个只可能被一个节点获得的资源，就皆大欢喜了。)

下面验证一下Pacemaker在没有法定投票没有抢占资源的情况下会怎么样?发生脑裂时会不会导致一个共享资源在2个节点上同时被加载？

1. 环境构成

沿用前一篇>(http://blog.chinaunix.net/uid-20726500-id-4453488.html)的环境，但改成双节点集群。

共享存储服务器
OS：CentOS release 6.5 (Final)
主机名:disknode
网卡1：
保留
网卡类型：NAT
IP：192.168.152.120
网卡2：
用于共享盘的iscsi通信
网卡类型：Host-Only
IP：192.168.146.120
网卡3：
用于external/ssh fence设备通信
网卡类型：桥接
IP：10.167.217.107

HA节点1
OS：CentOS release 6.5 (Final)
主机名:hanode1
网卡1：
用于集群公开IP（192.168.152.200）和集群内部消息的通信
网卡类型：NAT
IP：192.168.152.130
网卡2：
用于共享盘的iscsi通信
网卡类型：Host-Only
IP：192.168.146.130
网卡3：
用于external/ssh fence设备通信
网卡类型：桥接
IP：10.167.217.169

HA节点2
OS:CentOS release 6.5 (Final)
主机名:hanode2
网卡1：
用于集群公开IP（192.168.152.200）和集群内部消息的通信
网卡类型：NAT
IP：192.168.152.140
网卡2：
用于共享盘的iscsi通信
网卡类型：Host-Only
IP：192.168.146.140
网卡3：
用于external/ssh fence设备通信
网卡类型：桥接
IP：10.167.217.171

集群公开IP
192.168.152.200

2. 环境配置

在原来已配好的3节点的环境上，把disknode从集群管理里移除。

修改配置使达不到法定票数时的动作为忽略
no-quorum-policy=ignore

并修改法定票数
expected-quorum-votes=2

将disknode相关的配置删掉
location no_iscsid rs_iscsid -inf: disknode
location votenode ClusterIP -inf: disknode

[root@hanode1 ~]# crm configure edit
node disknode
node hanode1
node hanode2
primitive ClusterIP IPaddr2 \
params ip=192.168.152.200 cidr_netmask=32 \
op monitor interval=30s
primitive DataFS Filesystem \
params device="/dev/sdc" directory="/mnt/pg" fstype=ext4 \
op monitor interval=15s
primitive pg93 pgsql \
meta target-role=Started is-managed=true migration-threshold=INFINITY failure-timeout=60s \
op monitor interval=15s
primitive rs_iscsid lsb:iscsid \
op monitor interval=30s \
meta target-role=Started
primitive st-ssh stonith:external/ssh \
params hostlist="hanode1 hanode2"
group PgGroup ClusterIP rs_iscsid DataFS pg93
clone st-sshclone st-ssh
property cib-bootstrap-options: \
dc-version=1.1.9-2a917dd \
cluster-infrastructure="classic openais (with plugin)" \
expected-quorum-votes=2 \
stonith-enabled=true \
no-quorum-policy=ignore \
last-lrm-refresh=1409756808
#vim:set syntax=pcmk

在disknode上停掉corosync服务
[root@disknode ~]# /etc/init.d/corosync stop
Signaling Corosync Cluster Engine (corosync) to terminate: [ OK ]
Waiting for corosync services to unload:. [ OK ]
[root@disknode ~]# chkconfig corosync off

进入剩下的其中一个节点hanode1的终端，从CIB中移除disknode
[root@hanode1 ~]# crm_node -R disknode --force

结果很奇怪，它居然把hanode2删掉了，多试几次还遇到把hanode2重启的情况。
[root@hanode1 ~]# crm status
Last updated: Fri Sep 5 23:48:50 2014
Last change: Fri Sep 5 23:47:50 2014 via crm_node on hanode1
Stack: classic openais (with plugin)
Current DC: hanode1 - partition with quorum
Version: 1.1.9-2a917dd
2 Nodes configured, 2 expected votes
6 Resources configured.

Node disknode: UNCLEAN (offline)
Online: [ hanode1 ]

Resource Group: PgGroup
ClusterIP (ocf::heartbeat:IPaddr2): Started hanode1
rs_iscsid (lsb:iscsid): Started hanode1
DataFS (ocf::heartbeat:Filesystem): Started hanode1
pg93 (ocf::heartbeat:pgsql): Started hanode1
Clone Set: st-sshclone [st-ssh]
Started: [ hanode1 ]
Stopped: [ st-ssh:1 ]

/var/log/messages日志里有类似这样的错误消息
Sep 5 21:27:53 hanode1 corosync[3827]: [pcmk ] info: pcmk_remove_member: Sent: remove-peer:disknode
Sep 5 21:27:53 hanode1 corosync[3827]: [pcmk ] ERROR: ais_get_int: Characters left over after parsing 'disknode': 'disknode'

再执行一次，这回把disknode删掉了
[root@hanode1 ~]# crm_node -R disknode --force
[root@hanode1 ~]# crm status
Last updated: Fri Sep 5 23:50:16 2014
Last change: Fri Sep 5 23:50:14 2014 via crm_node on hanode1
Stack: classic openais (with plugin)
Current DC: hanode1 - partition with quorum
Version: 1.1.9-2a917dd
1 Nodes configured, 2 expected votes
5 Resources configured.

Online: [ hanode1 ]

Resource Group: PgGroup
ClusterIP (ocf::heartbeat:IPaddr2): Started hanode1
rs_iscsid (lsb:iscsid): Started hanode1
DataFS (ocf::heartbeat:Filesystem): Started hanode1
pg93 (ocf::heartbeat:pgsql): Started hanode1
Clone Set: st-sshclone [st-ssh]
Started: [ hanode1 ]

再重启hanode2服务器，状态终于正常了。
[root@hanode1 ~]# crm status
Last updated: Sat Sep 6 00:01:16 2014
Last change: Fri Sep 5 23:57:54 2014 via crmd on hanode1
Stack: classic openais (with plugin)
Current DC: hanode1 - partition with quorum
Version: 1.1.9-2a917dd
2 Nodes configured, 2 expected votes
6 Resources configured.

Online: [ hanode1 hanode2 ]

Resource Group: PgGroup
ClusterIP (ocf::heartbeat:IPaddr2): Started hanode1
rs_iscsid (lsb:iscsid): Started hanode1
DataFS (ocf::heartbeat:Filesystem): Started hanode1
pg93 (ocf::heartbeat:pgsql): Started hanode1
Clone Set: st-sshclone [st-ssh]
Started: [ hanode1 hanode2 ]

*）本来只是想重启corosync服务，但是停不掉corosync服务，于是杀掉corosync进程，正准备启动corosync服务，发现hanode2被fencing掉了。

3. 切换测试

3.1 场景1：主服务器的corosync进程down

杀掉主服务器的corosync进程
[root@hanode1 ~]# ps -ef|grep corosync
root 1355 1 0 00:00 ? 00:00:02 corosync
root 4606 2103 0 00:18 pts/0 00:00:00 grep corosync
[root@hanode1 ~]# kill -9 1355

马上发现hanode1被重启，而后hanode2接管服务。
[root@hanode2 ~]# crm status
Last updated: Sat Sep 6 00:22:45 2014
Last change: Fri Sep 5 23:57:54 2014 via crmd on hanode1
Stack: classic openais (with plugin)
Current DC: hanode2 - partition with quorum
Version: 1.1.9-2a917dd
2 Nodes configured, 2 expected votes
6 Resources configured.

Online: [ hanode1 hanode2 ]

Resource Group: PgGroup
ClusterIP (ocf::heartbeat:IPaddr2): Started hanode2
rs_iscsid (lsb:iscsid): Started hanode2
DataFS (ocf::heartbeat:Filesystem): Started hanode2
pg93 (ocf::heartbeat:pgsql): Started hanode2
Clone Set: st-sshclone [st-ssh]
Started: [ hanode1 hanode2 ]

查看hanode2上的日志，也发现hanode2在fencing成功后再接管资源，保证了资源不会被2个节点同时拥有。
[root@hanode2 ~]# vi /var/log/messages
Sep 6 00:15:08 hanode2 stonith-ng[1362]: notice: initiate_remote_stonith_op: Initiating remote operation reboot for hanode1: 3daae4d8-f3f5-47bb-ac32-1a7106099eca (0)
Sep 6 00:15:13 hanode2 stonith-ng[1362]: notice: log_operation: Operation 'reboot' [2149] (call 0 from crmd.1366) for host 'hanode1' with device 'st-ssh' returned: 0 (OK)
Sep 6 00:15:13 hanode2 stonith-ng[1362]: notice: remote_op_done: Operation reboot of hanode1 by hanode2 for crmd.1366@hanode2.3daae4d8: OK
...
Sep 6 00:15:14 hanode2 pengine[1365]: notice: LogActions: Start rs_iscsid#011(hanode2)
Sep 6 00:15:14 hanode2 pengine[1365]: notice: LogActions: Start DataFS#011(hanode2)
Sep 6 00:15:14 hanode2 pengine[1365]: notice: LogActions: Start pg93#011(hanode2)

3.1 场景2：主备服务间的心跳网络断开

在VMWare上禁用主服务器hanode2的心跳网卡。好戏来了，hanode1和hanode2同时干掉了对方。
hanode1和hanode2启动后，hanode1手快了一点，hanode2被杀掉。hanode2起来后又把hanode1干掉...
看来这种场景下，就是2个节点互砍，谁也成不了赢家。

4. 脑裂的改进

前面的场景2实际是心跳的网络成了单点故障，可以在心跳网络上引入网络设备的冗余，提高心跳网络的稳定性。除此以外还没有别的方法呢。设想有2种方法：

4.1 方法1

把hanode1和hanode2的心跳线接到同一个路由器上，并用这个路由器的ip地址作为pingnode。用这个pingnode做仲裁。
下面看看这个方法有没效。路由器的ip为192.168.152.2

[root@hanode1 ~]# crm configure edit

node hanode1
node hanode2
primitive ClusterIP IPaddr2 \
params ip=192.168.152.200 cidr_netmask=32 \
op monitor interval=30s
primitive DataFS Filesystem \
params device="/dev/sdc" directory="/mnt/pg" fstype=ext4 \
op monitor interval=15s
primitive pg93 pgsql \
meta target-role=Started is-managed=true migration-threshold=INFINITY failure-timeout=60s \
op monitor interval=15s
primitive pingCheck ocf:pacemaker:ping \
params name=default_ping_set host_list=192.168.152.2 multiplier=100 \
op start timeout=60s interval=0s on-fail=restart \
op monitor timeout=60s interval=10s on-fail=restart \
op stop timeout=60s interval=0s on-fail=ignore
primitive rs_iscsid lsb:iscsid \
op monitor interval=30s \
meta target-role=Started
primitive st-ssh stonith:external/ssh \
params hostlist="hanode1 hanode2"
group PgGroup ClusterIP rs_iscsid DataFS pg93
clone clnPingCheck pingCheck
clone st-sshclone st-ssh
location rsc_location PgGroup \
rule $id="rsc_location-rule" -inf: not_defined default_ping_set or default_ping_set lt 100
order rsc_orderi 0: clnPingCheck PgGroup
property cib-bootstrap-options: \
dc-version=1.1.9-2a917dd \
cluster-infrastructure="classic openais (with plugin)" \
expected-quorum-votes=2 \
stonith-enabled=true \
no-quorum-policy=ignore \
last-lrm-refresh=1409756808
#vim:set syntax=pcmk

修改后把2个节点的corosync服务重启一下，过一会状态更新。
[root@hanode1 ~]# crm_mon -Afr1
Last updated: Sat Sep 6 01:36:57 2014
Last change: Sat Sep 6 01:36:08 2014 via cibadmin on hanode1
Stack: classic openais (with plugin)
Current DC: hanode1 - partition with quorum
Version: 1.1.9-2a917dd
2 Nodes configured, 2 expected votes
8 Resources configured.

Online: [ hanode1 hanode2 ]

Full list of resources:

Resource Group: PgGroup
ClusterIP (ocf::heartbeat:IPaddr2): Started hanode1
rs_iscsid (lsb:iscsid): Started hanode1
DataFS (ocf::heartbeat:Filesystem): Started hanode1
pg93 (ocf::heartbeat:pgsql): Started hanode1
Clone Set: st-sshclone [st-ssh]
Started: [ hanode1 hanode2 ]
Clone Set: clnPingCheck [pingCheck]
Started: [ hanode1 hanode2 ]

Node Attributes:
* Node hanode1:
+ default_ping_set : 100
* Node hanode2:
+ default_ping_set : 100

Migration summary:
* Node hanode2:
* Node hanode1:

现在把主服务的心跳网卡禁掉。结果和以前一样，2个机器还是互杀。原因在于fencing机制在pingCheck之前动作，pingCheck仅仅可以影响资源的位置。看来这个方法不行。

4.1 方法2

把stonith-action 的动作从reboot改为off，这样手快的那一方有可能成为赢家。
[root@hanode1 ~]# crm_attribute --attr-name stonith-action --attr-value off

这里用的测试用的stonith设备external/ssh是不支持poweroff的，为了测试把external/ssh脚本改了一下，最终修改过的external/ssh脚本见附录。
再把前面加的pingCheck去掉，再试一次，把主服务的心跳网卡禁掉。很不幸2个机器都关机了。
看了下external/ssh的脚本，关机前sleep了2秒，把这个sleep去掉再试。这次终于有1个幸存了。

POWEROFF_COMMAND="echo 'sleep 2; /sbin/poweroff -nf' | SHELL=/bin/sh at now >/dev/null 2>&1"
==》
POWEROFF_COMMAND="echo '/sbin/poweroff -nf' | SHELL=/bin/sh at now >/dev/null 2>&1"

5. 结论

在双节点的情况下，只要有fencing设备就可以确保共享资源不被破坏了。如果没有fencing设备，就必须要配置抢占资源。当心跳网络出现故障时没法保障双机集群依然可用，但通过将stonith-action 的设置为off，可在很大概率上使得脑裂时有1台机器还活着。

6. 附录

修改过的 external/ssh脚本

[root@hanode1 ~]# cat /usr/lib64/stonith/plugins/external/ssh

点击(此处)折叠或打开

#!/bin/sh
#
# External STONITH module for ssh.
#
#
# This program is free software; you can redistribute it and/or modify
# it under the terms of version 2 of the GNU General Public License as
# published by the Free Software Foundation.
#
# This program is distributed in the hope that it would be useful, but
# WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
#
# Further, this software is distributed without any warranty that it is
# free of the rightful claim of any third person regarding infringement
# or the like. Any license provided herein, whether implied or
# otherwise, applies only to this software file. Patent licenses, if
# any, provided herein do not apply to combinations of this program with
# other software, or any other product whatsoever.
#
# You should have received a copy of the GNU General Public License
# along with this program; if not, write the Free Software Foundation,
# Inc., 59 Temple Place - Suite 330, Boston MA 02111-1307, USA.
#
SSH_COMMAND="/usr/bin/ssh -q -x -o PasswordAuthentication=no -o StrictHostKeyChecking=no -n -l root"
#SSH_COMMAND="/usr/bin/ssh -q -x -n -l root"
REBOOT_COMMAND="echo 'sleep 2; /sbin/reboot -nf' | SHELL=/bin/sh at now >/dev/null 2>&1"
# Warning: If you select this poweroff command, it'll physically
# power-off the machine, and quite a number of systems won't be remotely
# revivable.
# TODO: Probably should touch a file on the server instead to just
# prevent heartbeat et al from being started after the reboot.
POWEROFF_COMMAND="echo '/sbin/poweroff -nf' | SHELL=/bin/sh at now >/dev/null 2>&1"
#POWEROFF_COMMAND="echo 'sleep 2; /sbin/reboot -nf' | SHELL=/bin/sh at now >/dev/null 2>&1"
# Rewrite the hostlist to accept "," as a delimeter for hostnames too.
hostlist=`echo $hostlist | tr ',' ' '`
is_host_up() {
for j in 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
do
if
ping -w1 -c1 "$1" >/dev/null 2>&1
then
sleep 1
else
return 1
fi
done
return 0
}
echo hostlist="$hostlist" para="$*" >>/var/stonith_ssh.log
case $1 in
gethosts)
for h in $hostlist ; do
echo $h
done
exit 0
;;
on)
# Can't really be implemented because ssh cannot power on a system
# when it is powered off.
exit 1
;;
off)
# Shouldn't really be implemented because if ssh cannot power on a
# system, it shouldn't be allowed to power it off.
# exit 1
# ;;
h_target=`echo $2 | tr A-Z a-z`
for h in $hostlist
do
h=`echo $h | tr A-Z a-z`
[ "$h" != "$h_target" ] &&
continue
if
case ${livedangerously} in
[Yy]*) is_host_up $h;;
*) true;;
esac
then
$SSH_COMMAND "$2" "$POWEROFF_COMMAND"
# Good thing this is only for testing...
if
is_host_up $h
then
exit 1
else
exit 0
fi
else
# well... Let's call it successful, after all this is only for testing...
exit 0
fi
done
exit 1
;;
reset)
h_target=`echo $2 | tr A-Z a-z`
for h in $hostlist
do
h=`echo $h | tr A-Z a-z`
[ "$h" != "$h_target" ] &&
continue
if
case ${livedangerously} in
[Yy]*) is_host_up $h;;
*) true;;
esac
then
$SSH_COMMAND "$2" "$REBOOT_COMMAND"
# Good thing this is only for testing...
if
is_host_up $h
then
exit 1
else
exit 0
fi
else
# well... Let's call it successful, after all this is only for testing...
exit 0
fi
done
exit 1
;;
status)
if
[ -z "$hostlist" ]
then
exit 1
fi
for h in $hostlist
do
if
ping -w1 -c1 "$h" 2>&1 | grep "unknown host"
then
exit 1
fi
done
exit 0
;;
getconfignames)
echo "hostlist"
exit 0
;;
getinfo-devid)
echo "ssh STONITH device"
exit 0
;;
getinfo-devname)
echo "ssh STONITH external device"
exit 0
;;
getinfo-devdescr)
echo "ssh-based host reset"
echo "Fine for testing, but not suitable for production!"
echo "Only reboot action supported, no poweroff, and, surprisingly enough, no poweron."
exit 0
;;
getinfo-devurl)
echo "http://openssh.org"
exit 0
;;
getinfo-xml)
cat
Hostlist
The list of hosts that the STONITH device controls
Live Dangerously!!
Set to "yes" if you want to risk your system's integrity.
Of course, since this plugin isn't for production, using it
in production at all is a bad idea. On the other hand,
setting this parameter to yes makes it an even worse idea.
Viva la Vida Loca!
SSHXML
exit 0
;;
*)
exit 1
;;
esac

Pacemaker+corosync搭建双节点HA集群的可靠性验证