How Does Alibaba Cloud Power the Biggest Online Shopping Festival?

简介: Have you ever wondered what the underlying technology behind Alibaba Single’s Day Shopping Festival (also known as 11-11) is like?

Author: Alibaba Group Senior Staff Engineer Ding Yu

Have you ever wondered what the underlying technology behind Alibaba Single's Day Shopping Festival (also known as 11-11) is like? With sales reaching over US$17.8 billion in 2016, Single's Day has become the largest online shopping day in the world!

Alibaba Cloud's infrastructure has evolved rapidly to cope with increasing demands from the entire Alibaba ecosystem, especially for Single's Day. From 2009 to 2016, we have witnessed an increase of peak transaction volume of over 400 times!

1

Figure 1: Peak transaction volume on Single's Day from 2009 to 2016

Such feat can only be achieved with a robust computing architecture, not only capable of handling bursty traffic but also capable of quickly recovering from system faults. While sales revenue typically grows linearly with transaction volume, system complexity becomes exponentially difficult at such a large scale. What's more, deploying and maintaining such complex system is labor intensive and costly.

Designing a High Availability Infrastructure

As the Architect for Single's Day since 2009, I will share with you some of our key strategies in designing our infrastructure.

Although cloud computing has freed us from the geographical constraints of data centers, supporting an event such as Single's Day isn't as straightforward as simply adding more servers. We need to know precisely how much computing power we need to ensure high availability and reliability while keeping costs at a minimum.

Alibaba Cloud tackles this problem from multiple angles:
1.Comprehensive load testing on system architecture
2.System architecture fault simulation
3.Cross-region server deployment
4.Automated intelligent control

We will cover these four topics in further detail in the following sections.

2

Figure 2: Enterprise high availability design

Comprehensive Load Testing on System Architecture

Load testing is one of the default metric for performance testing in most systems. Basically, what we do is to simulate the traffic load of Single's Day and test it on our existing infrastructure. We use traffic data collected from previous years as well as predicted data to account for this year's growth. One of the important purpose of load testing is not only to discover the maximum capacity but also to determine the most common applications and services that customers use during this period.

System Architecture Fault Simulation

Essentially, fault simulation is a form of stress testing on our system architecture. We intentionally disable certain services, overloading the system with heavy loads. In particular, we look out for any Single-Point-of-Failures (SPOFs) in our architecture and eliminate them.

Cross-Region Server Deployment

In most scenarios, servers only run within a single region. However, this approach may not be sufficient when faced with extreme loads during Single's Day. Therefore, we utilize cross-region deployment to expand the capacity and improve service availability. We split users into different servers based on user ID, and employ an active-active configuration in our clusters to maintain high availability and achieve seamless service handover. In addition, data is also backed up across multiple sites to enhance disaster recovery capabilities.

3

Figure 3: High availability multi-region cluster

Automated Intelligent Control

Even with all of the technologies discussed previously, it is almost impossible to control traffic flow and scale resources in a large system manually. That is why we use an automated intelligent control, which focuses on traffic control and fault recovery.

Because we don't have access to unlimited resources, there is always a possibility of having too much load. To handle this problem, we can prioritize users based on the type of request. For example, customers completing purchases should be prioritized over users who are only browsing a website. Once we prioritize them, we can put them in a queue and complete requests based on this queue. We can also adjust the service of quality received by users based on this queueing system.

4

Figure 4: User traffic control

As the number of devices increases, the probability of fault occurring in devices increases as well. When a server fails, our system detects this anomaly and reassigns the user to the next nearest server. This automatic approach significantly reduces delay, which in turn improves user experience and minimizes O&M costs. In addition, this system will trigger alarms to notify our engineers about these faults, helping our team to quickly locate and troubleshoot faults.

5

Figure 5: Server fault recovery

Conclusion

As we can see, powering an event as large as Single's Day is no easy task. With proper planning and design, we can cope even the most unexpected challenges for this event. We are confident that our evolved architecture can achieve a lot more for this year's Single's Day festival!

However, one question springs to mind – What do we do with all this computing power when the festival ends? For most of our systems, we adopt a hybrid cloud environment. With hybrid cloud, we can scale resources as required but also maintain a "lighter" system when the load is low (such as when Single's Day festival ends). This way, we can minimize operating costs while maximizing our capacity.

In addition, we utilize Alibaba Cloud's core products as well as our family of distributed middleware. Currently, our distributed middleware offerings are only limited to Mainland China customers, but we are hoping to make them available to customers from across the globe soon.

If you want to learn more about the underlying technology for Alibaba Single's Day, please check out my presentation video at The Computing Conference 2017.

If you are interested in building your own infrastructure with Alibaba Cloud products, you should definitely check out our attractive offers on 11-11 Cloud Deals!

Core Products (available globally):
Elastic Compute Service (ECS)
Server Load Balancer (SLB)
Auto Scaling
ApsaraDB for RDS
CDN

Distributed Middleware (currently only available in Mainland China):
• Distributed Relational Database Service (DRDS)
• Cloud Service Bus (CSB)
• Global Transaction Service (GTS)
• Application Real-Time Monitoring Service (ARMS)
• Message Queue (MQ)
• Enterprise Distributed Application Service (EDAS)

目录
相关文章
|
域名解析 编解码 网络协议
|
网络协议 安全 Unix
Admin & Engineer & Services Topic | Cloud computing (FREE)
云计算 Admin & Engineer & Services 习题(试读)
103 0
|
网络协议 关系型数据库 MySQL
Cloud platform build management Topic | Cloud computing (FREE)
云平台构建及管理习题(试读)
118 0
|
存储 缓存 应用服务中间件
Network & Shell & Operation & Automation Topic | Cloud computing (FREE)
云计算 Network & Shell & Operation & Automation 习题(试读)
76 0
|
安全
SAP Customer Data Cloud的administrator设置
SAP Customer Data Cloud的administrator设置
88 0
SAP Customer Data Cloud的administrator设置
SAP Customer Data Cloud支持的Social Media channel
SAP Customer Data Cloud支持的Social Media channel
107 0
SAP Customer Data Cloud支持的Social Media channel
|
Java Maven Android开发
《Cloud Toolkit User Guide》
Alibaba Cloud Toolkit,面向 IDE(如 Eclipse 或 IntelliJ IDEA )的插件,帮助开发者更高效的开发、测试、诊断并部署适合云端运行的应用
37815 1
|
分布式计算 关系型数据库 数据库
New Product Launch: Alibaba Cloud Data Integration
Support online real-time & offline data exchange between all data sources, networks and locations with Alibaba Cloud Data Integration.
14538 0
New Product Launch: Alibaba Cloud Data Integration
|
网络协议 安全 关系型数据库
Manage Customer Relations with SuiteCRM on Alibaba Cloud
By Jeff Cleverley, Alibaba Cloud Tech Share Author. Tech Share is Alibaba Cloud's incentive program to encourage the sharing of technical knowledge and best practices within the cloud community.
10044 3
Manage Customer Relations with SuiteCRM on Alibaba Cloud