HAWQ是一种基于HDFS的MPP(Massively Parallel Processing) SQL引擎,支持标准SQL/事务处理,性能比原生Hive快几百倍。
本文介绍在E-MapReduce集群上面如何搭建HAWQ。
AI 代码解读
零、 部署模式
HAWQ有多种部署模式
-
非HA
- standalone
- yarn
-
HA
- standalone
- yarn
本文以HA-yarn模式为例,其它部署模式配置方面相对简单点,可以参考文档。
一、创建集群
在E-MapReduce产品页创建集群,本例使用HA集群。
5台机器:
master:emr-header-1
standby:emr-header-2
slaves:
emr-worker-1
emr-worker-2
emr-worker-3
AI 代码解读
二、部署HAWQ
1. 添加gpadmin账号
在集群的所有机器上面操作:
> sudo su hadoop
> sudo useradd -G hadoop gpadmin
> sudo passwd gpadmin # 设置一个密码
> sudo vi /etc/sudoers
末尾添加 gpadmin ALL=(ALL) NOPASSWD: ALL 并保存
AI 代码解读
2. 安装hawq
master节点
-
安装hawq
> sudo su root > wget http://emr-agent-pack.oss-cn-hangzhou.aliyuncs.com/hawq/hawq-2.0.0.0-22126.x86_64.rpm > rpm -ivh hawq-2.0.0.0-22126.x86_64.rpm
AI 代码解读 -
打通ssh
> sudo su gpadmin > vi hosts ## 添加集群所有节点的IP > vi segment ## 添加所有slave节点的IP > vi masters ## 添加所有master/standby节点的IP > source /usr/local/hawq/greenplum_path.sh > hawq ssh-exkeys -f hosts
AI 代码解读 -
修改系统参数
> hawq ssh -f hosts -e 'sudo sysctl -w kernel.sem=\"50100 128256000 50100 2560\"'
AI 代码解读 -
安装其它节点HAWQ
> hawq scp -f hosts hawq-2.0.0.0-22126.x86_64.rpm =:~/ > hawq ssh -f hosts -e "sudo rpm -ivh ~/hawq-*.rpm"
AI 代码解读 -
创建HAWQ相关文件夹
> hawq ssh -f masters -e 'sudo mkdir /mnt/disk{2..4}' > hawq ssh -f masters -e 'sudo chown hdfs:hadoop /mnt/disk{2..4}' > hawq ssh -f masters -e 'sudo chmod 770 /mnt/disk{2..4}' > hawq ssh -f masters -e 'mkdir -p /mnt/disk1/hawq/data/master' > hawq ssh -f segment -e 'mkdir -p /mnt/disk1/hawq/data/segment' > hawq ssh -f hosts -e 'mkdir -p /mnt/disk{1..4}/hawq/tmp'
AI 代码解读 -
修改yarn为capacity-scheduler调度模式
> vi /etc/emr/hadoop-conf/yarn-site.xml 添加属性: <property> <name>yarn.resourcemanager.scheduler.class</name> <value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler</value> </property> 将master的yarn-site.xml同步到其它所有节点 重启集群yarn
AI 代码解读 -
修改HAWQ配置
> vi /usr/local/hawq/etc/hawq-site.xml 修改如下属性
AI 代码解读
属性名 | 值 | 说明 |
---|---|---|
hawq_master_address_host | emr-header-1 | |
hawq_standby_address_host | emr-header-2 | |
hawq_dfs_url | emr-cluster/hawq_default | |
hawq_master_directory | /mnt/disk1/hawq/data/master | |
hawq_segment_directory | /mnt/disk1/hawq/data/segment | |
hawq_master_temp_directory | /mnt/disk1/hawq/tmp,/mnt/disk2/hawq/tmp,/mnt/disk3/hawq/tmp,/mnt/disk4/hawq/tmp | |
hawq_segment_temp_directory | /mnt/disk1/hawq/tmp,/mnt/disk2/hawq/tmp,/mnt/disk3/hawq/tmp,/mnt/disk4/hawq/tmp | |
hawq_global_rm_type | yarn | |
hawq_rm_yarn_address | emr-header-1:8032,emr-header-2:8032 | |
hawq_rm_yarn_scheduler_address | emr-header-1:8030,emr-header-2:8030 |
> vi /usr/local/hawq/etc/hdfs-client.xml
#打开HA的注释
AI 代码解读
属性名 | 值 | 说明 |
---|---|---|
dfs.nameservices | emr-cluster | |
dfs.ha.namenodes.emr-cluster | nn1,nn2 | |
dfs.namenode.rpc-address.emr-cluster.nn1 | emr-header-1:8020 | |
dfs.namenode.rpc-address.emr-cluster.nn2 | emr-header-2:8020 | |
dfs.namenode.http-address.emr-cluster.nn1 | emr-header-1:50070 | |
dfs.namenode.http-address.emr-cluster.nn2 | emr-header-2:50070 |
> vi /usr/local/hawq/etc/yarn-client.xml
#打开HA的注释
AI 代码解读
属性名 | 值 | 说明 |
---|---|---|
yarn.resourcemanager.ha | emr-header-1:8032,emr-header-2:8032 | |
yarn.resourcemanager.scheduler.ha | emr-heaer-1:8030,emr-header-2:8030 |
> vi /usr/local/hawq/etc/slaves #添加segment节点IP
AI 代码解读
综上修改完master节点的HAWQ配置之后,需要同步到其它所有节点
> hawq scp -f hosts /usr/local/hawq/etc/yarn-client.xml /usr/local/hawq/etc/hdfs-client.xml /usr/local/hawq/etc/hawq-site.xml /usr/local/hawq/etc/slaves =:/usr/local/hawq/etc/
AI 代码解读
3.启动HAWQ集群
> hawq init cluster
AI 代码解读
4. 验证
> psql -d postgres
postgres=# create database mytest;
CREATE DATABASE
postgres=# \c mytest
You are now connected to database "mytest" as user "gpadmin".
mytest=# create table t (i int);
CREATE TABLE
mytest=# insert into t select generate_series(1,100);
INSERT 0 100
mytest=# \timing
Timing is on.
mytest=# select count(*) from t;
count
-------
100
(1 row)
Time: 77.333 ms
mytest=# select * from t;
AI 代码解读