NoSQL Databases --Key-/Value-Stores

本文涉及的产品
云数据库 Redis 版,社区版 2GB
推荐场景:
搭建游戏排行榜
云数据库 RDS MySQL Serverless,0.5-2RCU 50GB
简介:

Key-/value-stores have a simple data model in common: a map/dictionary, allowing clients to put and request values per key. 
Besides the data-model and the API, modern key-value stores favor high scalability over consistency and therefore most of them also omit rich ad-hoc querying and analytics features (especially joins and aggregate operations are set aside).

Key-/value-stores have existed for a long time (e. g. Berkeley DB [Ora10d]) but a large number of this class of NoSQL stores that has emerged in the last couple of years has been heavily influenced by Amazon’s Dynamo which will be investigated thoroughly in this chapter. Among the great variety of free and opensource key-/value-stores, Project Voldemort will be further examined. At the end of the chapter some other notable key-/value-stores will be briefly looked at.

这个应该算最简单的一种NoSQL, 强调high scalability, 而在一致性和查询,分析feature上就比较弱了. 这种类型的数据库很早就出现了, 象berkeley DB, 但是那是数据量没有那么huge, 所以看不出这种简单类型的优势.

所以至到Amazon’s Dynamo, 才开始火起来...

 

4.1. Amazon’s Dynamo

Amazon Dynamo is one of several databases used at Amazon for different purposes (others are e. g. SimpleDB or S3, the Simple Storage Service, cf. [Ama10b], [Ama10a]).

DeCandia et al. note their contribution to the research community is that Dynamo as an “eventually consistent storage system can be used in production with demanding applications”.

直接读论文吧...

http://www.cnblogs.com/fxjwind/archive/2012/06/26/2563628.html

底下两点总结的不错, 笔记

4.1.4. Implementation and Optimizations

In addition to the system design, DeCandia et al. make concrete remarks on the implementation of Dynamo. Based on experience at Amazon, they also provide suggestions for its optimization (cf. [DHJ+07, p. 213f]):

  • The code running on a Dynamo node consists of a request coordination, a membership and a failure detectioncomponent—each of them implemented in Java.
  • Dynamo provides pluggable persistence components (e. g. Berkeley Database Transactional Data Store and BDB Java Edition [Ora10d], MySQL [Ora10c], an in-memory buffer with a persistent backing store). Experience from Amazon shows that “[applications] choose Dynamo’s local persistence engine based on their object size distribution”.
  • The component in charge of coordinating requests is implemented “on top of an event-driven messaging substrate where the message processing pipeline is split into multiple stages”. Internode communication employs Java NIO channels (see [Ora10b]).
  • When a client issues a read or write request, the contacted node will become its coordinator. It then creates a state machine that “contains all the logic for identifying the nodes responsible for a key, sending the requests, waiting for responses, potentially doing retries, processing the replies and packaging the response to the client. Each state machine instance handles exactly one client request.”
  • If a request coordinator receives a stale version of data from a node, it will update it with the latest version. This is called read-repair “because it repairs replicas that have missed a recent update at an opportunistic time and relieves the anti-entropy protocol from having to do it”.
  • To achieve even load-distribution write requests can address to “any of the top N nodes in the preference list”.
  • As an optimization reducing latency for both, read and write requests, Dynamo allows client applications to be coordinator of these operations.
  • As write requests are usually succeeding read requests, Dynamo is pursuing a further optimization: 
    in a read response, the storage node replying the fastest to the read coordinator is transmitted to the client; in a subsequent write request, the client will contact this node. “This optimization enables us to pick the node that has the data that was read by the preceding read operation thereby increasing the chances of getting “read-your-writes” consistency”.
  • A further optimization to reach Amazon’s desired performance in the 99.9th percentile is the introduction of an object buffer in the main memory of storage nodes
    In this buffer write requests are queued and persisted to disk periodically by a writer thread. 
    In read operations both the object buffer and the persisted data of storage nodes have to be examined
    Using an in-memory buffer that is being persisted asynchronously can result in data loss when a server crashes. To reduce this risk, the coordinator of the write request will choose one particular replica node to perform a durable write for the data item. This durable write will not impact the performance of the request as the coordinator only waits for W − 1 nodes to answer before responding to the client (cf. [DHJ+07, p. 215]).
  • Dynamo nodes do not only serve client requests but also perform a number of background tasks. For resource division between request processing and background tasks a component named admission controller is in charge that “constantly monitors the behavior of resource accesses while executing a “foreground” put/get operation”. Monitoring includes disk I/O latencies, lock-contention and transaction timeouts resulting in failed database access as well as waiting periods in the request queue. Via monitoring feedback, the admission controller will decide on the number of time-slices for resource access or consumption to be given to background tasks. It also coordinates the execution of background tasks which have to explicitly apply for resources with the admission controller (cf.[DHJ+07, p. 218]).

 

4.1.5. Evaluation

To conclude the discussions on Amazon’s Dynamo, a brief evaluation of Bob Ippolito’s talk “Drop ACID and think about Data” shall be presented in table 4.2.

Advantages Disadvantages

No master” 
"Highly available for write” operations 
“Knobs for tuning reads” (as well as writes) 
“Simple”

“Proprietary to Amazon” 
"Clients need to be smart“ (support vector 
clocks and conflict resolution, balance 
clusters) 
“No compression” (client applications may 
compress values themselves, keys cannot be 
compressed) 
“Not suitable for column-like workloads” 
"Just a Key/Value store“ (e. g. range 
queries or batch operations are not possible)

 

4.2. Project Voldemort

Project Voldemort is a key-/value-store initially developed for and still used at LinkedIn.

http://project-voldemort.com/design.php

http://www.54chen.com/document/dynamo-based-systems.html , 中文

没看出design有什么干货和新鲜的东西, 甚至都没有说一下, 他们为什么要开发这个系统, 直接用dynamo不行吗?

在设计上, 采用了分层结构, 保持每一层独立意味着可以混合和匹配使用以满足运行中不同的需求。例如,我们可以增加一个压缩层,将字节值的压缩水平降低到序列化之下。同样,在将数据路由到分区的时候我们可以做灵活的智能路由.

Following figure depicts options for the physical deployment of Project Voldemort with a focus on routing and load-balancing.

The trade-off depicted in figure is reducing latency by minimizing network hops vs. strong coupling by routing-logicthat moves up the software-stack towards the frontend

 

4.2.3. Versioning and Consistency

Like Amazon’s Dynamo Project Voldemort is designed to be highly available for write operations, allows concurrent modifications of data and uses vector clocks to allow casual reasoning about different versions (see sections 3.1.3 and 4.1.3). If the datastore itself cannot resolve version conflicts, client applications are requested for conflict resolution at read time.

 

4.2.4. Persistence Layer and Storage Engines 
Project Voldemort provides pluggable persistency as the lowest layer of the logical architecture allows for different storage engines. Out of the box Berkeley DB (default storage engine), MySQL as well as in-memory storage are delivered with Project Voldemort.

To use another existing or self-written storage engine, the operations get, put and delete besides an iterator for values will have to be implemented.

具有扩展性, 只要你实现get, put and delete 3个接口就ok

 

4.2.5. Data Model and Serialization 
On the lowest level keys and values in Project Voldemort are simply byte-arrays. In order to allow applications a more sophisticated notion of keys and values, data models can be configured for each Voldemort store. These data models define serializers that are responsible to convert byte-arrays into the desired data structures and formats. Project Voldemort already contains serializers for the following data structures and formats:

JSON (JavaScript Object Notation) is a binary and typed data model which supports the data types list, map, date, boolean as well as numbers of different precision (cf. [Cro06]).

String to store uninterpreted strings, which can also be used for XML blobs. 
Java Serialization provided by Java classes implementing the java.io.Serializable interface (cf.[Ora10a], [Blo01, p. 213ff]). 
Protocol Buffers are “Google’s language-neutral, platform-neutral, extensible mechanism for serializing structured data” which also contains an interface description language to generate code for custom data interchange. Protocol Buffers are widely used at Google “for almost all of its internal RPC protocols and file formats” (cf. [Goo10a], [Goo10b]). 
Identity does no serialization or deserialization at all but simply hands over byte-arrays.

 

4.2.6. Index Precalculation 
Project Voldemort allows to prebuild indexes offline (e. g. on Hadoop), upload them to the datastore and transparently swap to them. This is especially useful for batch operations inserting or updating large amounts of data causing index rebuilds if the data is uploaded as a whole or index fragmentation if it is inefficiently inserted in small portions. In Project Voldemort large amounts of data can be inserted as a whole and the offline index building feature disburdens live systems from full index rebuilding or index fragmentation.

 

4.3. Other Key-/Value-Stores

4.3.1. Tokyo Cabinet and Tokyo Tyrant

Tokyo Cabinet ([FAL10a], http://fallabs.com/tokyocabinet/) is the core library of this datastore persisting data and exposing a key-/value-interface to clients and thereby abstracting from internal data structures such as hash-tables or B+tree-indexes.

Tokyo Tyrant ([FAL10b],network interface of Tokyo Cabinet, http://fallabs.com/tokyotyrant/) provides access to this database library via network which is possible via a proprietary binary protocol, HTTP as well as through the memcached protocol.

Tokyo Cabinet manages data storage on disk and in memory in a fashion similar to paging / swapping. Memory pages are flushed to disk periodically “which leaves an open data loss hole”, as Hoff  comments (cf. [Hof09a]).

And the winner is: MySQL or Memcached or Tokyo Tyrant?, http://highscalability.com/blog/2009/10/28/and-the-winner-is-mysql-or-memcached-or-tokyo-tyrant.html

Tokyo Tyrant is a disk based key-value store. Tyrant was nearly 2x faster than MySQL + memcached and about 20% slower than a memcached only solution. Keep in mind one concern about Tyrant is that it is not distributed, it's not a scale-out solution, so it will face sharding issues similar to MySQL. It also flushes data to disk periodically which leaves an open data loss hole and variable performance during syncs.

 

Tokyo Cabinet allows to compress pages by the LZW-algorithm which can achieve good compression ratios (see e. g. [Lin09]).

Tokyo Cabinet does partition data automatically and therefore has to be replicated with a strategy similar to MySQL. In addition to lookups by key it can match prefixes and ranges if keys are ordered.

Regarding transactional features worth mentioning Tokyo Cabinet provides write-ahead logging and shadow paging. The Toyko suite is developed actively, well documented and and widely regarded as high-performant: 1 million records can be stored in 0.7 seconds using the hash-table engine and in 1.6 seconds using the b-tree according to North (cf. [Nor09], [Ipp09], [See09], [Lin09], [Hof09a]).

 

4.3.2. Redis 
Redis is a relatively new datastore which its developers unspecifically refer to as a “data structure store”; it is commonly subsumed under key-/value-stores because of its map/dictionary-API. Special about Redis is that it allows matching for key-ranges e. g. matching of numeric ranges or regular expressions. In contrast to other key-/value-stores, Redis does not only store bytes as values but also allows lists and sets in values by supporting them directly.

A major disadvantage about Redis is that the amount of main memory limits the amount of data that is possible to store. This cannot be expanded by the usage of hard-disks. Ippolito comments that for this reason Redis is probably a good fit for a caching-layer (cf. [Ipp09]).

 

4.3.3. Memcached and MemcacheDB 
Memcached
, the popular and fast memory-caching solution widely used among large and ultra-large scale web sites to reduce database-load.

Though not intended for persistent storage (cf. [F+10b]) there is an existing solution named MemcacheDB (cf. [Chu09]) that conforms to the memcached protocol (cf.[F+10c]) and adds persistence based on Berkeley DB to it (cf. [Nor09], [Ipp09]).

As memcached does not provide any replication between nodes and is also not tolerant towards machine failures, simply adding a persistent storage to it is not enough, as blogger Richard Jones of Last.fm remarks. He notes that solutions like repcached can replicate whole memcached servers (in a master slave setup) but without fault-tolerant partitioning they will cause management and maintenance efforts (cf. [Jon09]).

 

4.3.4. Scalaris 
Scalaris is a key-/value-store written in Erlang taking profit of this language’s approach e. g. in implementing a non-blocking commit protocol for atomic transactions.

Scalaris uses an adapted version of the chord service (cf. [SMK+01]) to expose a distributed hash table to clients. As it stores keys in lexicographic order, range queries on prefixes are possible.

In contrast to other key-/value-stores Scalaris has a strict consistency model, provides symmetric replication and allows for complex queries (via programming language libraries). It guarantees ACID properties also for concurrent transactions by implementing an adapted version of the Paxos consensus protocol (cf. [Lam98]). Like memcached, Scalaris is a pure in-memory key-/value-store (cf. [Nor09], [Jon09]).


本文章摘自博客园,原文发布日期:2012-06-27

相关实践学习
基于Redis实现在线游戏积分排行榜
本场景将介绍如何基于Redis数据库实现在线游戏中的游戏玩家积分排行榜功能。
云数据库 Redis 版使用教程
云数据库Redis版是兼容Redis协议标准的、提供持久化的内存数据库服务,基于高可靠双机热备架构及可无缝扩展的集群架构,满足高读写性能场景及容量需弹性变配的业务需求。 产品详情:https://www.aliyun.com/product/kvstore     ------------------------------------------------------------------------- 阿里云数据库体验:数据库上云实战 开发者云会免费提供一台带自建MySQL的源数据库 ECS 实例和一台目标数据库 RDS实例。跟着指引,您可以一步步实现将ECS自建数据库迁移到目标数据库RDS。 点击下方链接,领取免费ECS&RDS资源,30分钟完成数据库上云实战!https://developer.aliyun.com/adc/scenario/51eefbd1894e42f6bb9acacadd3f9121?spm=a2c6h.13788135.J_3257954370.9.4ba85f24utseFl
目录
相关文章
|
21天前
|
缓存 NoSQL 关系型数据库
在Python Web开发过程中:数据库与缓存,MySQL和NoSQL数据库的主要差异是什么?
MySQL是关系型DB,依赖预定义的表格结构,适合结构化数据和复杂查询,但扩展性有限。NoSQL提供灵活的非结构化数据存储(如JSON),无统一查询语言,但能横向扩展,适用于大规模、高并发场景。选择取决于应用需求和扩展策略。
111 1
|
2月前
|
存储 NoSQL 关系型数据库
面试题18: NOSQL数据库
面试题18: NOSQL数据库
|
3月前
|
存储 NoSQL API
一个小巧、快速、轻量级的 .NET NoSQL 嵌入式数据库
一个小巧、快速、轻量级的 .NET NoSQL 嵌入式数据库
135 0
|
1月前
|
存储 NoSQL 关系型数据库
四种类型的nosql数据库
随着互联网的发展,传统关系型数据库已经不能满足大数据时代的需求。NoSQL数据库应运而生,它们具有高可扩展性、高性能和高可用性等优点。本文将介绍四种主要类型的NoSQL数据库,分别是键值存储数据库、文档存储数据库、列存储数据库和图形数据库。这些数据库在不同的场景下有着不同的应用,可以满足不同的需求。
|
1月前
|
存储 缓存 NoSQL
|
3月前
|
多模数据库 Cloud Native NoSQL
Nosql学习之路:云原生多模数据库Lindorm训练营第一弹来啦
Lindorm训练营系列将通过一系列由浅入深的高质量课程和丰富的动手实验,将理论与实践结合,带你从入门到成为高阶开发者。参营学习还有机会获得惊喜彩蛋~
|
3月前
|
缓存 NoSQL MongoDB
在使用NoSQL数据库时,你遇到过哪些挑战?如何解决这些挑战?
在使用NoSQL数据库时,你遇到过哪些挑战?如何解决这些挑战?
27 0
|
3月前
|
存储 JSON NoSQL
请列举一些常见的NoSQL数据库类型和其特点。
请列举一些常见的NoSQL数据库类型和其特点。
44 0
|
3月前
|
存储 SQL NoSQL
NoSQL数据库的优点和缺点是什么?
NoSQL数据库的优点和缺点是什么?
85 0
|
3月前
|
存储 NoSQL 关系型数据库
什么是NoSQL数据库?它与传统关系型数据库有什么区别?
什么是NoSQL数据库?它与传统关系型数据库有什么区别?
51 0