Sharing, Storing, and Computing Massive Amounts of Data-阿里云开发者社区

BD_001

Background

Data is crucial to the operation of any business. Businesses often collect large numbers of logs so that they can better understand their own services and the people who are using them.

As time goes by, the number and activity of users continuously increases, as does the speed at which data is generated, the accumulated volume of data, and the number of data dimensions and data types. As a result, we end up with more and more data islands.

We have reached a point where data islands are now heavy burden for a company's IT department, increasing costs without increasing earnings.

Storing and processing large amounts of data poses serious problems to businesses, such as:

Serious issue of data islands (without a big data cloud platform)
Inaccurate estimation of costs, wasting resources when resizing computational capacity
Prominent issue of data redundancy
High storage costs
On-premise hardware is hard to scale and the maintenance cost is high
Data size is too big to be backed up.
Extensive business needs lead to diverse data types, and high analysis and development costs

This article discusses some potential solutions to address issues when dealing with massive amounts of data for specific industries.

Industries

1. Logistics

Packages generate a lot of tracking data from collection to shipment, transportation, transshipment, and distribution. Each time a package reaches a checkpoint, it will be scanned and its status will be recorded.

During the transportation process, the vehicle associated with the package is tracked, including the vehicle records, route, fuel consumption, vehicle status, and driver's status. During the distribution process, both the positional information of the courier and the distribution status of the package will be tracked, which also generates a large amount of data.

A transportation vehicle may generate dozens of thousands of tracking records in a day. A courier may also generate dozens of thousands of tracking records in a day. Even a small package may generate hundreds of tracking records!

One potential solution for the logistics industry is dynamic path planning. With dynamic path planning, we can obtain information on the nearby couriers in real time according to their positional information. This information can then be used to create an optimal least cost path for quick delivery. If you are interested, you can refer to Technologies used in Double 11 shopping festival - logistics and dynamic path planning (article in Chinese).

2. Finance

The financial industry is another generator of mass data. Financial industry data mainly includes data of user transactions, enterprise transactions, and securities. To make matters worse, most transactions in the financial industry requires real-time computing, and the use of a wide range of different statistical analysis functions.

Additionally, data collected in the financial industry must be kept secure. I have shared an analysis on the needs of a securities trading simulation system. If you are interested, refer to Analysis and application of PostgreSQL in databases in the securities industry (article in Chinese).

3. Internet of Things (IoT)

Data generated by IoT is sequential, and therefore requires StreamCompute (e.g. triggered upon reaching the threshold) and post-event analysis. As the data size is huge, it also needs to be compressed.

Another feature of IoT is that data reported by sensors always contains numerical ranges (e.g. a temperature range), geographic positions, images, etc. How can we effectively store and search such varied data types? I have also analyzed the features of databases of the IoT applications in a few articles (articles in Chinese).

Rise of StreamCompute - PostgreSQL joins hands with PipelineDB to support IoT

Implementation of the revolving door data compression algorithm in PostgreSQL - application of streaming compression in the IoT, monitoring, and sensor scenarios

The cutting-edge technology of PostgreSQL for IoT - creating an index (BRIN Index) a fraction of the normal size

A simple algorithm to help IoT and financial users cut data storage costs by 98%

Streaming application on IoT - real-time processing (trillions of data entries per day) with PostgreSQL

PostgreSQL cutting-edge technology - range type and Gist indexes boost development in IoT

4. Monitoring

Monitoring involves overseeing the business operations and IT infrastructure statuses, such as server status, network status, and storage status. Monitoring has been used considerably in all kinds of businesses, but monitoring needs and data types vary within industries.

For example, some industries may need positional monitoring, such as tracking the location of a bus and sending alerts when it goes beyond a geo fence, or when the driver displays unusual driving behaviors.

5. Public security

Public security data comes from various fields, including communication records and travel records.

Public security data can be incredibly large. A typical use case may be risk control or capturing criminal suspects. It relies heavily on analyzing chronological, geographic positional information (image searches).

How can these needs be met?

Industry pain points

The challenges of running businesses with massive amounts of data can be summarized with this list of questions:

How can we solve the issue of data islands and open up data sharing channels?
How can we efficiently write logs, behavioral tracks, financial data, and trajectory data?
How can we quickly and efficiently process data, send threshold alerts, and conduct real-time analysis?
How can we solve the disaster tolerance and backup problems faced by big data?
How can we solve the problems of compression and efficiency for big data?
How can we solve the problems imposed by multi-dimensional data, diversified data types, and complex computing?
How can we help companies improve the scalability of their IT architectures?

Solutions

We will cover three components:

1.RDS PostgreSQL

RDS PostgreSQL supports chronological data, block range indexes, inverted indexes, multi-core parallel computing, JSON, array storage, OSS_FDW external read/write, and other features.

RDS PostgreSQL solves the problems of OLTP, GIS application, complex query, spatial data processing, multi-dimensional analysis, and cold/hot data separation.

2.HybridDB PostgreSQL

HybridDB PostgreSQL supports array storage, horizontal expansion, block compression, extensive data types, machine learning library, PLPYTHON, PLJAVA, and PLR programming, OSS_FDW external read and write, etc.

It completely solves the issue of computing massive amounts of data.

3.Object Storage Service (OSS)

Data can be shared among multiple RDS instances through OSS_FDW. OSS supports multi-copy, and cross regional replication.

OSS solves the problems of data islands, storing massive data, cross-data-center disaster tolerance, massive data backup, etc.

1. Writing

Data is written in three ways:

Online real-time writing implemented through the RDS SQL interface, which allows a single instance to achieve a writing speed of more than 1 million lines/second.
Batch quasi real-time writing implemented through the HybridDB SQL interface, which allows a single instance to achieve a writing speed of more than 1 million lines/second.
Batch quasi real-time writing, e.g. writing to a file, implemented through the OSS interface, which supports bandwidth auto scaling.

2. Sharing

Data can be shared among multiple RDS instances through OSS_FDW.

Let's assume that we have two enterprises, and we have used two RDS database instances, but there is a part of the data that needs to be shared between them. The traditional method requires the use of ETL, but now we can efficiently achieve data sharing among multiple instances by using OSS_FDW.

The RDS PostgreSQL OSS_FDW parallel read/write function (to enable multiple worker processes to read and write to the same table file) provides a highly efficient read/write process for shared data.

The parallel feature is embodied in three features: OSS parallel read/write, RDS PostgreSQL multi-core parallel computing, multi-computer parallel computing of RDS PostgreSQL or HybridDB.

3. Storage

Local RDS PostgreSQL and HybridDB are used for real time data storage. When stored data needs to be analyzed or shared, we store it in OSS.

When we compare the amount of storage with the computing resources used, the cost of OSS is much lower, which reduces the company's IT costs without negatively influencing flexibility.

OSS solves the problems of enterprises on data redundancy and high costs, and meets the requirements on data backup and disaster tolerance.

4. Computing

By using RDS PostgreSQL, HybridDB, and OSS, we can separate our computing resources and storage resources.

Since we are able to store less data at the computing nodes (most of the data is stored in OSS), resizing, reducing, disaster tolerance, and backing up the computer nodes is much easier.

Computing methods may be divided into the following types

1.Stream computing

StreamCompute can be divided in two sub-types, real-time computing, and real-time threshold alerts.

Both can be achieved through pipelinedb (base on postgresql).

Benefits:

The standard SQL interface and a wide range of different built-in functions support complex StreamCompute demands. The diversified data types (including GIS, JSON, etc.) support even more StreamCompute business scenarios. This asynchronous message–based mechanism supports the needs of the second type of StreamCompute.

pipelinedb is currently under adjustment, and may be used as an extension of PostgreSQL in the future.

https://github.com/pipelinedb/pipelinedb/issues?q=is%3Aissue+is%3Aopen+label%3A%22extension+refactor%22

Taking the monitoring industry as an example, the asynchronous message–based mechanism of StreamCompute can effectively avoid the futility of traditional active inquiry monitoring methods.

2.Real-time interaction

We can use RDS PostgreSQL to meet traditional OLTP demands.

Features of PostgreSQL include: GIS, JSON, array, cold/hot data separation, horizontal database partitioning, K-V type, multi-core parallel computing, BRIN indexes, GIN indexes, etc.

Scenarios supported by PostgreSQL include: StreamCompute, image searches, chronological data, path planning, fuzzy searches, full-text searches, similarity searches, auction sniping, genetics, finance, chemistry, GIS application, complex searches, BI, multi-dimensional analysis, spatial data searches, etc.

It covers a wide range of industries, including banking, insurance, securities, IoT, Internet, gaming, astronomy, publishing, e-commerce, traditional enterprises, etc.

3.Quasi real-time analysis

Used in combination with OSS, both RDS PostgreSQL and HybridDB can achieve quasi real-time analysis.

The same OSS data can also be shared and concurrently accessed by multiple instances.

4.Offline data analysis and data mining

Used in combination with OSS, both RDS PostgreSQL and HybridDB can be used to analyze and mine offline data.

RDS PostgreSQL supports single computer multi-core parallel computing, and HybridDB for PostgreSQL supports multi-computer multi-core parallel computing. Users can make the choice based on computing capacity.

Required computing capacities

The soul of computing is working with different data types.

PostgreSQL supports a wide range of different built-in data types, including numbers, strings, time, Boolean, enumeration, array, range, GIS, full-text search, bytea, large object, geometry, bit, XML, UUID, JSON, composite types, etc., as well as data types customized by users. PostgreSQL supports almost all business scenarios.
Operators. In order to satisfy data processing needs, PostgreSQL offers a wealth of operations for each of the supported data types.
PostgreSQL has extensive built-in functions, including statistical, trigonometric, GIS, and MADlib machine learning functions.
Customized computing logic. Users can use a variety of programming languages, including C, python, java, and R, to customize these data handling functions, and expand the data handling capacity of PostgreSQL and HybridDB for PostgreSQL.
There are extensive built-in aggregate functions to support statistical analysis.
Window query functions.
Recursive queries.
Multi-dimensional analytical syntax.

Summary

The Advantages of RDS PostgreSQL

are mainly represented in the following aspects:

1.Performance

RDS PostgreSQL mainly handles online transactions and a small amount of quasi real-time analysis.

Performance of PostgreSQL OLTP is at the level of commercial databases. For details, refer to this article:

Performance evaluation in the database industry_tpc.org

Multi-core parallel computing, JIT, operator reuse, and other features of PostgreSQL allow it to leave other RDBMS databases far behind in terms of OLAP capacity. For more on the OLAP analytical capacity of PostgreSQL, refer to this article:

Analysis of the cutting-edge technology of accelerator engines - open the treasure chest of PostgreSQL with the combination of LLVM, array storage, multi-core parallel computing, and operator reuse

PostgreSQL 10 also has many enhancements in the aspect of HTAP.

2.Features

Features are a major strength of PostgreSQL. You can find the detailed description in the previous section "Required computing capacity".

3.Scalability

The performance of complex computing can be enhanced by adding additional CPUs.

Storage capacity of RDS PG can be enhanced and the storage limit can be broken by using the OSS storage and the OSS_FDW extension.

4.Cost

As most of the data that needs to be separated is stored at the OSS, users do not need to worry about the disaster tolerance and backups for this part of the data. Compared to storage in databases, storage costs are greatly reduced.

Both RDS PG and HybridDB PG support a wide variety of standard SQL interfaces, which are also used to access data stored in OSS (the Table interface). All of these features combined lead to significant reductions in development costs.

Maintenance cost: There is almost no O&M cost when using the cloud services.

5.Industries covered

It covers a wide range of industries, including banking, insurance, securities, IoT, Internet, gaming, astronomy, publishing, e-commerce, traditional enterprises, etc.

Advantages of HybridDB PostgreSQL

1.Performance

HybridDB PostgreSQL is an MPP architecture, and therefore has outstanding computing performance.

2.Features

You can find the detailed description in the previous section "Required computing capacity".

3.Scalability

The performance of complex computing can be enhanced by adding additional computing nodes.

Storage capacity of RDS PG can be enhanced and the storage limit can be broken by using the OSS storage and the OSS_FDW extension.

4.Cost

Maintenance cost: There is almost no O&M cost when using the cloud services.

5.Industries covered

It covers a wide range of industries, including banking, insurance, securities, IoT, Internet, gaming, astronomy, publishing, e-commerce, traditional enterprises, etc.