HBase-TDG ClientAPI The Basics

简介:

table.delete(delete);
table.close();

Atomic Compare-and-Delete

You have seen in the section called “Atomic Compare-and-Set” how to use an atomic, conditional operation to insert data into a table. There is an equivalent call for deletes that give you access to server side, read-and-modify functionality:

boolean checkAndDelete(byte[] row, byte[] family, byte[] qualifier, byte[] value, Delete delete) throws IOException

参考checkAndPut, 原理一样 

Batch Operations

You have seen how you can add, retrieve, and remove data from a table using single, or list based operations. In this section we will look at API calls to batch different operations across multiple rows. 
前面我们看到的所有put, get, delete的list操作, 实际上也是基于batch实现的.

private final static byte[] ROW1 = Bytes.toBytes("row1");
private final static byte[] ROW2 = Bytes.toBytes("row2");
private final static byte[] COLFAM1 = Bytes.toBytes("colfam1");
private final static byte[] COLFAM2 = Bytes.toBytes("colfam2");
private final static byte[] QUAL1 = Bytes.toBytes("qual1");
private final static byte[] QUAL2 = Bytes.toBytes("qual2");
List<Row> batch = new ArrayList<Row>();
Put put = new Put(ROW2);
put.add(COLFAM2, QUAL1, Bytes.toBytes("val5"));
batch.add(put);
Get get1 = new Get(ROW1);
get1.addColumn(COLFAM1, QUAL1);
batch.add(get1);
Delete delete = new Delete(ROW1);
delete.deleteColumns(COLFAM1, QUAL2);
batch.add(delete);
Get get2 = new Get(ROW2);
get2.addFamily(Bytes.toBytes("BOGUS"));
batch.add(get2); //Fail, column family BOGUS does not exist

Object[] results = new Object[batch.size()];
try {
    table.batch(batch, results);
} catch (Exception e) {
    System.err.println("Error: " + e);
}
for (int i = 0; i < results.length; i++) {
    System.out.println("Result[" + i + "]: " + results[i]);
}

简单的例子, 可以把对不同row的Put, Get, Delete操作都放到一个batch中处理. 
注意不能将同一个row的put和delete放在一个batch里面, 另外batch操作不会使用client端的write buffer, 会直接发给server

Be aware that you should not mix a Delete and Put operation for the same row in one batch call. The operations will be applied in a different order from that guarantees the best performance, but also causes unpredictable results. In some cases you may see fluctuating results due to race conditions.

 

Write Buffer and Batching 
When you use the batch() functionality the included Put instances will not be buffered using the client-side write buffer. The batch() calls are synchronous and send the operations directly to the servers, there is no delay or other intermediate processing used. This is different compared to the put() calls obviously, so choose which one you want to use carefully.

 

Row Locks

Mutating operations - like put(), delete(), checkAndPut(), and so on - are executed exclusively, which means in a serial fashion, for each row, to guarantee the row level atomicity. 
The regions servers provide a row lock feature ensuring that only a client holding the matching lock can modify a row. In practice though most client applications do not provide an explicit lock but rely on the mechanism in place that guard each operation separately.

对于HBase的row level atomicity必须靠row locks来保证, 虽然系统本身提供了自动的lock机制, 但是也提供了显式的lock的调用接口. 
啥时候用? 上面也写了
You should avoid using row locks whenever possible.

 

Scans

After the basic CRUD type operations you will now be introduced to scans, a technique akin to cursors[55] in database systems, making use of the underlying sequential, sorted storage layout HBase is providing.

Introduction

Using the scan operations is very similar to the get() methods. 
And again, in symmetry to all the other functions there is also a supporting class, named Scan. But since scans are similar to iterators you do not have a scan() call but rather a getScanner(), which returns the actual scanner instance you need to iterate over. The available methods are:

ResultScanner getScanner(Scan scan) throws IOException
ResultScanner getScanner(byte[] family) throws IOException
ResultScanner getScanner(byte[] family, byte[] qualifier) throws IOException

Scan类定义了Scan的条件, getScanner必须以一个scan类作为参数(上面后两个,系统还是会为你创建scan类对象的), 返回ResultScanner是个迭代器(iterators), 可以通过next()来获取数据.

Scan类的定义如下,
Scan()
Scan(byte[] startRow, Filter filter)
Scan(byte[] startRow)
Scan(byte[] startRow, byte[] stopRow)

The start row is always inclusive, while the end row is exclusive. This is often expressed as [startRow, stopRow) in the interval notation

Scan和get一样支持如下更多的限定条件

Scan addFamily(byte [] family)
Scan addColumn(byte[] family, byte[] qualifier)
Scan setTimeRange(long minStamp, long maxStamp) throws IOException
Scan setTimeStamp(long timestamp)
Scan setMaxVersions()
Scan setMaxVersions(int maxVersions)

 

The ResultScanner Class

Scans do not ship all the matching rows in one RPC to the client but instead do this on a row basis. This obviously makes sense as rows could be very large and sending thousands, and most likely more, of them in one call would use up too many resources, and take a long time.

The ResultScanner converts the scan into a get-like operation, wrapping the Result instance for each row into an iterator functionality. It has few methods of its own:

Result next() throws IOException
Result[] next(int nbRows) throws IOException
void close() // release scanner

Make sure you release a scanner instance as timely as possible. An open scanner holds quite a few resources on the server side, which could accumulate to a large amount of heap space occupied.

 

Caching vs. Batching

So far each call to next() will be a separate RPC for each row – even when you use the next(int nbRows) method, because it is nothing else but a client side loop over next() calls. Obviously this is not very good for performance when dealing with small cells, thus it would make sense to fetch more than one row per RPC if possible. This is called scanner caching and is by default disabled.

每次next都要一次RPC的话, 效率是比较低, 尤其当row数据比较小的时候, 所以会有scanner caching的出现, 一次RPC可以对多条row, 这个可以配置, 需要适度, 否则调用时间和client的memory都会有问题.

 

So far you have learned to use the client-side scanner caching to make better use of bulk transfers between your client application and the remote regions servers. 
There is an issue though that was mentioned in passing earlier: very large rows. Those - potentially – do not fit into the memory of the client process. HBase and its client API has an answer for that: batching. As opposed to caching, which operates on a row level, batching works on the column level instead. It controls how many columns are retrieved for every call to any of the next() functions provided by the ResultScanner instance. For example, setting the scan to use setBatch(5) would return five columns per Result instance.

Batch相反, 对应于非常大的row, 一个row需要分几次读, 以column为单位



本文章摘自博客园,原文发布日期: 2012-09-26

相关实践学习
云数据库HBase版使用教程
&nbsp; 相关的阿里云产品:云数据库 HBase 版 面向大数据领域的一站式NoSQL服务,100%兼容开源HBase并深度扩展,支持海量数据下的实时存储、高并发吞吐、轻SQL分析、全文检索、时序时空查询等能力,是风控、推荐、广告、物联网、车联网、Feeds流、数据大屏等场景首选数据库,是为淘宝、支付宝、菜鸟等众多阿里核心业务提供关键支撑的数据库。 了解产品详情:&nbsp;https://cn.aliyun.com/product/hbase &nbsp; ------------------------------------------------------------------------- 阿里云数据库体验:数据库上云实战 开发者云会免费提供一台带自建MySQL的源数据库&nbsp;ECS 实例和一台目标数据库&nbsp;RDS实例。跟着指引,您可以一步步实现将ECS自建数据库迁移到目标数据库RDS。 点击下方链接,领取免费ECS&amp;RDS资源,30分钟完成数据库上云实战!https://developer.aliyun.com/adc/scenario/51eefbd1894e42f6bb9acacadd3f9121?spm=a2c6h.13788135.J_3257954370.9.4ba85f24utseFl
目录
相关文章
|
6月前
|
存储 缓存 分布式计算
HBase入门指南
HBase是一个开源的非关系型分布式数据库,设计初衷是为了解决大量结构化数据存储与处理的需求
197 0
HBase入门指南
|
5月前
|
存储 分布式计算 Hadoop
92 hbase简介
92 hbase简介
29 0
|
存储 JSON 监控
|
分布式数据库 Hbase
《JanusGraph —Distributed graph database with HBase》电子版地址
JanusGraph —Distributed graph database with HBase
70 0
《JanusGraph —Distributed graph database with HBase》电子版地址
|
OLAP 分布式数据库 Apache
《Apache Kylin on HBase extreme OLAP for big data》电子版地址
Apache Kylin on HBase: extreme OLAP for big data
81 0
《Apache Kylin on HBase extreme OLAP for big data》电子版地址
|
JSON 关系型数据库 MySQL
Structured_Source_HDFS_案例介绍 | 学习笔记
快速学习 Structured_Source_HDFS_案例介绍
90 0
Structured_Source_HDFS_案例介绍 | 学习笔记
|
JSON 分布式计算 Hadoop
Structured_Source_HDFS_Spark 代码 | 学习笔记
快速学习 Structured_Source_HDFS_Spark 代码
83 0
|
存储 分布式计算 NoSQL
Hbase入门(一)——初识Hbase
本文将介绍大数据的知识和Hbase的基本概念,作为大数据体系中重要的一员,Hbase弥补了Hadoop只能离线批处理的不足,支持存储小文件,随机检索。而这种特性使得Hbase对于实时计算体系的事件存储有天然的较好的支持。这使得Hbase在实时流式计算中也扮演者重要的角色。
300 0
Hbase入门(一)——初识Hbase
|
存储 分布式数据库 索引
技术篇-HBase Coprocessor 的实现与应用
本次分享的内容主要分为以下五点: Coprocessor 简介 Endpoint 服务端实现 Endpoint 客户端实现 Observer 实现二级索引 Coprocessor 应用场景 1.
3397 0