使用OpenAPI自动化处理ECS系统事件-阿里云开发者社区

使用OpenAPI自动化处理ECS系统事件

2018-04-11 1969

版权

本文内容由阿里云实名注册用户自发贡献，版权归原作者所有，阿里云开发者社区不拥有其著作权，亦不承担相应法律责任。具体规则请查看《阿里云开发者社区用户服务协议》和《阿里云开发者社区知识产权保护指引》。如果您发现本社区中有涉嫌抄袭的内容，填写侵权投诉表单进行举报，一经查实，本社区将立刻删除涉嫌侵权内容。

本文涉及的产品

轻量应用服务器 2vCPU 1GiB，适用于搭建电商独立站

轻量应用服务器 2vCPU 4GiB，适用于网站搭建

轻量应用服务器 4vCPU 16GiB，适用于搭建游戏自建服

简介： 什么是系统事件当您将业务系统部署到阿里云ECS后，阿里云保证ECS计算服务的高可用。在极少情况下，比如探测到ECS实例所在的硬件发生故障，会产生有计划的维护事件并通知您。深入了解系统事件，请参考：实例系统事件让运维更高效：关于ECS系统事件监控和应对系统事件的方式为了业务的平稳运行，您需要监控ECS系统事件并及时合理地应对系统事件。

什么是系统事件

当您将业务系统部署到阿里云ECS后，阿里云保证ECS计算服务的高可用。在极少情况下，比如探测到ECS实例所在的硬件发生故障，会产生有计划的维护事件并通知您。

深入了解系统事件，请参考：

监控和应对系统事件的方式

为了业务的平稳运行，您需要监控ECS系统事件并及时合理地应对系统事件。

从控制台处理ECS主动运维事件请参考 ECS主动运维事件--让你HOLD住全场

相对于收到通知后登陆ECS控制台人工处理系统事件，通过程序自动化监控和处理系统事件，能够提高您的运维效率，消除遗漏或出错的可能性，让您的运维人员不用再为半夜的故障通知而烦恼。如果您保有较多的ECS实例，自动化程序的优点将会更加突出。

ECS为您提供了两个OpenAPI来监控实例的健康状态和系统事件。

1. DescribeInstancesFullStatus 查询实例的全状态信息

ECS实例全状态信息包括：

实例的生命周期状态，比如实例处于Running还是Stopped状态
实例的健康状态，比如您的实例处于Ok还是Warning状态
处于待执行状态（Scheduled）的所有系统事件

这个OpenAPI关注实例的当前状态，它不会返回已经完结的历史事件。对于事前运维来说，我们只需要关注Scheduled状态的事件。事件处于Scheduled状态意味着现在仍处在用户操作窗口期。在事件的计划执行时间NotBefore之前，我们可以通过程序处理来避免事件执行。

首先，我们调用DescribeInstancesFullStatus OpenAPI来查询当前是否存在待执行的SystemMaintenance.Reboot事件。

def build_instance_full_status_request():
    request = DescribeInstancesFullStatusRequest()
    request.set_EventType('SystemMaintenance.Reboot')
    return request


# send open api request
def _send_request(request):
    request.set_accept_format('json')
    try:
        response_str = client.do_action_with_exception(request)
        logging.info(response_str)
        response_detail = json.loads(response_str)
        return response_detail
    except Exception as e:
        logging.error(e)


# only_check=True时仅检查是否存在SystemMaintenance.Reboot事件，为False时对SystemMaintenance.Reboot事件进行处理
def check_scheduled_reboot_events(only_check=False, instance_id=None):
    request = build_instance_full_status_request()
    if instance_id:
        request.set_InstanceIds([instance_id])
    response = _send_request(request)
    if response.get('Code') is None:
        instance_full_status_list = response.get('InstanceFullStatusSet').get('InstanceFullStatusType')
        # 因为指定了事件类型查询，无SystemMaintenance.Reboot系统事件的实例不会返回
        exist_reboot_event = len(instance_full_status_list) > 0
        if not exist_reboot_event:
            print "No scheduled SystemMaintenance.Reboot event found"
        if only_check:
            return exist_reboot_event
        for instance_full_status in instance_full_status_list:
            instance_id = instance_full_status.get('InstanceId')
            scheduled_reboot_events = instance_full_status.get('ScheduledSystemEventSet').get(
                'ScheduledSystemEventType')
            for scheduled_reboot_event in scheduled_reboot_events:
                handle_reboot_event(instance_id, scheduled_reboot_event)
    else:
        logging.error(str(response))
        
          
        
        
        
          
          AI 代码解读

Tip：主动运维系统事件会留出足够长的用户操作窗口期，一般以天为单位。所以并不需要频繁的去轮询待执行的系统事件。未来我们将会提供基于消息队列的系统事件消费接口。

如果发现存在SystemMaintenance.Reboot系统事件，您应该根据实例上运行的业务类型来决定是否需要自行处理。

Tip：即使由ECS系统执行重启，对您的重要数据进行提前备份也是一个好主意。

如果实例重启对业务有影响，你可能需要选择一个NotBefore之前的更合适的业务低谷时间点。您需要设定一个定时任务，在这个时间点执行重启操作。


def handle_reboot_event(instance_id, reboot_event):
    not_before_str = reboot_event.get('NotBefore')
    not_before = datetime.strptime(not_before_str, '%Y-%m-%dT%H:%M:%SZ')
    print "Instance %s has a SystemMaintenance.Reboot event scheduled to execute at %s" % (instance_id, str(not_before))
    # 根据你的业务特性选择not_before之前的影响最小的时间点
    # 使用定时任务在该时间点进行实例重启

    # 示例中简化为立即重启
    pre_reboot(instance_id)
    reboot_instance(instance_id)
    post_reboot(instance_id)


def reboot_instance(instance_id):
    print "Reboot instance %s now..." % instance_id
    reboot_request = RebootInstanceRequest()
    reboot_request.set_InstanceId(instance_id)
    _send_request(reboot_request)


def pre_reboot(instance_id):
    # 重启前做backup等等准备工作
    print "Do pre-reboot works..."


def post_reboot(instance_id):
    # 重启后做健康检查等等善后工作
    # 检查重启是否成功
    print "Do post-reboot works..."

    # 一般情况下重启成功后几秒后SystemMaintenance.Reboot事件将变为Avoided状态
    # 再次查询DescribeInstancesFullStatus确认SystemMaintenance.Reboot事件无法查询到
    wait_event_disappear(instance_id)
        
          
        
        
        
          
          AI 代码解读

重启成功完成后，系统事件将在短时间内变为Avoided状态。

def wait_event_disappear(instance_id):
    wait_sec = 0
    while wait_sec < TIME_OUT:
        exist = check_scheduled_reboot_events(only_check=True, instance_id=instance_id)
        if not exist:
            print "SystemMaintenance.Reboot system event is avoided"
            return
        time.sleep(10)
        wait_sec += 10
        
          
        
        
        
          
          AI 代码解读

您的自动化处理程序需要妥善处理各种异常情况，保证定时重启的及时性和稳定性。尤其注意的是，在事件状态变化前不要重复处理，以避免不必要的重启。

2. DescribeInstanceHistoryEvents 查询实例的历史事件

查询指定ECS实例的系统事件，默认查询已经处于非活跃状态的历史事件。如果指定全部的事件状态，可以查询包含活跃事件在内的所有事件。

此API默认只查询历史事件，它的用途是对实例的历史事件进行分析、复盘，追溯问题原因。某些事件类型比如SystemFailure.Reboot发生时，不一定会留出用户操作窗口期。比如非预期的紧急故障发生后，阿里云立刻进行了恢复并重启了您的实例。此类事件可以在历史事件中查询到。

总结

使用DescribeInstancesFullStatus来查询实例状态和Scheduled状态的系统事件
使用DescribeInstanceHistoryEvents对历史事件进行复盘。如果指定系统事件状态，也可以查询未结束的系统事件（Scheduled和Executing状态）。
使用自动化程序对Scheduled状态的系统事件进行处理
如果只需要查询系统事件，推荐使用DescribeInstanceHistoryEvents接口，性能更好。

未来我们将会发布更多类型的ECS实例和存储相关系统事件，覆盖更多运维场景，敬请期待！

完整的示例代码如下

#  coding=utf-8

# if the python sdk is not install using 'sudo pip install aliyun-python-sdk-ecs'
# if the python sdk is install using 'sudo pip install --upgrade aliyun-python-sdk-ecs'
# make sure the sdk version is 4.4.3, you can use command 'pip show aliyun-python-sdk-ecs' to check

import json
import logging
from datetime import datetime
import time

from aliyunsdkcore import client
from aliyunsdkecs.request.v20140526.DescribeInstancesFullStatusRequest import DescribeInstancesFullStatusRequest
from aliyunsdkecs.request.v20140526.RebootInstanceRequest import RebootInstanceRequest

logging.basicConfig(level=logging.INFO,
                    format='%(asctime)s %(filename)s[line:%(lineno)d] %(levelname)s %(message)s',
                    datefmt='%a, %d %b %Y %H:%M:%S')

# your access key Id
ak_id = "YOU_ACCESS_KEY_ID"
# your access key secret
ak_secret = "YOU_ACCESS_SECRET"
region_id = "cn-shanghai"
TIME_OUT = 5 * 60

client = client.AcsClient(ak_id, ak_secret, region_id)


def build_instance_full_status_request():
    request = DescribeInstancesFullStatusRequest()
    request.set_EventType('SystemMaintenance.Reboot')
    return request


# send open api request
def _send_request(request):
    request.set_accept_format('json')
    try:
        response_str = client.do_action_with_exception(request)
        logging.info(response_str)
        response_detail = json.loads(response_str)
        return response_detail
    except Exception as e:
        logging.error(e)


# only_check=True时仅检查是否存在SystemMaintenance.Reboot事件，为False时对SystemMaintenance.Reboot事件进行处理
def check_scheduled_reboot_events(only_check=False, instance_id=None):
    request = build_instance_full_status_request()
    if instance_id:
        request.set_InstanceIds([instance_id])
    response = _send_request(request)
    if response.get('Code') is None:
        instance_full_status_list = response.get('InstanceFullStatusSet').get('InstanceFullStatusType')
        # 因为指定了事件类型查询，无SystemMaintenance.Reboot系统事件的实例不会返回
        exist_reboot_event = len(instance_full_status_list) > 0
        if not exist_reboot_event:
            print "No scheduled SystemMaintenance.Reboot event found"
        if only_check:
            return exist_reboot_event
        for instance_full_status in instance_full_status_list:
            instance_id = instance_full_status.get('InstanceId')
            scheduled_reboot_events = instance_full_status.get('ScheduledSystemEventSet').get(
                'ScheduledSystemEventType')
            for scheduled_reboot_event in scheduled_reboot_events:
                handle_reboot_event(instance_id, scheduled_reboot_event)
    else:
        logging.error(str(response))


def handle_reboot_event(instance_id, reboot_event):
    not_before_str = reboot_event.get('NotBefore')
    not_before = datetime.strptime(not_before_str, '%Y-%m-%dT%H:%M:%SZ')
    print "Instance %s has a SystemMaintenance.Reboot event scheduled to execute at %s" % (instance_id, str(not_before))
    # 根据你的业务特性选择not_before之前的影响最小的时间点
    # 使用定时任务在该时间点进行实例重启

    # 示例中简化为立即重启
    pre_reboot(instance_id)
    reboot_instance(instance_id)
    post_reboot(instance_id)


def reboot_instance(instance_id):
    print "Reboot instance %s now..." % instance_id
    reboot_request = RebootInstanceRequest()
    reboot_request.set_InstanceId(instance_id)
    _send_request(reboot_request)


def pre_reboot(instance_id):
    # 重启前做backup等等准备工作
    print "Do pre-reboot works..."


def post_reboot(instance_id):
    # 重启后做健康检查等等善后工作
    # 检查重启是否成功
    print "Do post-reboot works..."

    # 一般情况下重启成功后几秒后SystemMaintenance.Reboot事件将变为Avoided状态
    # 再次查询DescribeInstancesFullStatus确认SystemMaintenance.Reboot事件无法查询到
    wait_event_disappear(instance_id)


def wait_event_disappear(instance_id):
    wait_sec = 0
    while wait_sec < TIME_OUT:
        exist = check_scheduled_reboot_events(only_check=True, instance_id=instance_id)
        if not exist:
            print "SystemMaintenance.Reboot system event is avoided"
            return
        time.sleep(10)
        wait_sec += 10


if __name__ == '__main__':
    check_scheduled_reboot_events(only_check=False)
        
          
        
        
        
          
          AI 代码解读

使用OpenAPI自动化处理ECS系统事件

什么是系统事件

监控和应对系统事件的方式

1. DescribeInstancesFullStatus 查询实例的全状态信息

2. DescribeInstanceHistoryEvents 查询实例的历史事件

总结

云服务器ECS

热门文章

最新文章

相关产品

相关课程

相关电子书

相关实验场景

探索云世界

热门

云计算

大数据

云原生

人工智能

数据库

开发与运维

使用OpenAPI自动化处理ECS系统事件

什么是系统事件

监控和应对系统事件的方式

1. DescribeInstancesFullStatus 查询实例的全状态信息

2. DescribeInstanceHistoryEvents 查询实例的历史事件

总结

云服务器ECS

热门文章

最新文章

相关产品

相关课程

相关电子书

相关实验场景