本节书摘来自华章社区《Hadoop大数据分析与挖掘实战》一书中的第3章,第3.3节动手实践,作者张良均 樊哲 赵云龙 李成华 ,更多章节内容可以访问云栖社区“华章社区”公众号查看
3.3 动手实践
按照3.1.2节以及第2章的详细配置步骤进行操作,部署完成后即可进行下面的实验(默认使用Hadoop 2.6和Hive 1.2.1版本)。
实践一:Hive表
1)下载“02-上机实验/visits_data.txt”文件,并查看数据。
\[root@slave2 opt\]# head -n 5 visits_data.txt
BUCKLEY SUMMER 10/12/2010 14:48 10/12/2010 14:45 WH
CLOONEYGEORGE10/12/2010 14:47 10/12/2010 14:45 WH
PRENDERGASTJOHN10/12/2010 14:48 10/12/2010 14:45 WH
LANIERJAZMIN10/13/2010 13:00 WH BILL SIGNING/
MAYNARDELIZABETH10/13/2010 12:34 10/13/2010 13:00 WH BILL SIGNING/visits_data.txt数据包含6列,分别对应名字,姓,访问时间,计划访问时间,地点,备注,使用“\t”分隔。
2)下载“02-上机实验/visits.hive”,并查看。[root@ slave2 opt]# cat visits.
hive
--cat visits.hive
create table people_visits (
last_name string,
first_name string,
arrival_time string,
scheduled_time string,
meeting_location string,
info_comment string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t' ;上述代码是Hive中新建表的代码,使用上述代码即可建立Hive中的表。
3)使用Hive命令,建立Hive的people_visits表。root@ slave2 bin\]# ./hive -f /opt/visits.hive
Logging initialized using configuration in jar:file:/opt/apache-hive-1.2.1-bin/lib/hive-common-1.2.1.jar!/hive-log4j.properties
OK
Time taken: 2.391 seconds4)使用hive shell命令行,查看生产的表。\[root@ slave2 ~\]# hive
Logging initialized using configuration in jar:file:/opt/apache-hive-1.2.1-bin/lib/hive-common-1.2.1.jar!/hive-log4j.properties
hive> show tables;
OK
people_visits
Time taken: 1.344 seconds, Fetched: 1 row(s)
hive> describe people_visits ;
OK
last_name string
first_namestring
arrival_timestring
scheduled_timestring
meeting_locationstring
info_commentstring
Time taken: 0.338 seconds, Fetched: 6 row(s)这里可以看到刚才建立的表,以及表的描述。
5)插入数据。
①使用查询命令查看表中的数据。hive> select * from people_visits limit 10;
OK
Time taken: 0.863 seconds可以看到表中没有数据。
②使用hadoop fs命令,拷贝visits_data.txt到HDFS的/user/hive/warehouse/people_visits目录中。\[root@ slave2 opt\]# hadoop fs -put visits_data.txt /user/hive/warehouse/people_visits
\[root@ slave2 opt\]# hadoop fs -ls /user/hive/warehouse/people_visits
-rw-r--r-- 3 root supergroup 989239 2015-08-17 10:30 /user/hive/warehouse/people_visits/visits_data.txt③再次查看数据。hive> select * from people_visits limit 5;
OK
BUCKLEY SUMMER 10/12/2010 14:48 10/12/2010 14:45 WH
CLOONEYGEORGE10/12/2010 14:47 10/12/2010 14:45 WH
PRENDERGASTJOHN10/12/2010 14:48 10/12/2010 14:45 WH
LANIERJAZMIN10/13/2010 13:00WH BILL SIGNING/
MAYNARDELIZABETH10/13/2010 12:34 10/13/2010 13:00 WH BILL SIGNING/
Time taken: 0.155 seconds, Fetched: 5 row(s)可以看到已经查看到数据了。
6)使用MR进行查询。hive> select count(*) from people_visits;
Query ID = root_20150817103724_d20ca51d-06ca-4efb-be59-6f66aec97489
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1439775378077_0003, Tracking URL = http://node101:8088/proxy/application_1439775378077_0003/
Kill Command = /opt/hadoop-2.6.0/bin/hadoop job -kill job_1439775378077_0003
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2015-08-17 10:37:33,759 Stage-1 map = 0%, reduce = 0%
2015-08-17 10:37:41,432 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 2.11 sec
2015-08-17 10:37:48,932 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 4.57 sec
MapReduce Total cumulative CPU time: 4 seconds 570 msec
Ended Job = job_1439775378077_0003
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 4.57 sec HDFS Read: 996387 HDFS Write: 6 SUCCESS
Total MapReduce CPU Time Spent: 4 seconds 570 msec
OK
17977
Time taken: 25.92 seconds, Fetched: 1 row(s)这里使用MR查询看到查询所有行数。
7)删除people_visits表。hive> drop table people_visits;
OK
Time taken: 1.355 seconds
hive> dfs -ls /user/hive/warehouse/people_visits;
ls: '/user/hive/warehouse/people_visits': No such file or directory
Command failed with exit code = 1
Query returned non-zero code: 1, cause: null这里看到删除表之后,HDFS中的数据也被删除了。
实践二:Hive外部表
1)拷贝“02-上机实验/names.txt”到客户端机器/opt目录下,并上传至HDFS。\[root@ slave2 ~\]# hadoop fs -put /opt/names.txt /user/root/names.txt
\[root@ slave2 ~\]# hadoop fs -ls /user/root/names.txt
-rw-r--r-- 3 root supergroup 78 2015-08-17 11:11 /user/root/names.txt
\[root@ slave2 ~\]#2)在HDFS上新建/user/root/hivedemo文件夹。\[root@ slave2 ~\]# hadoop fs -mkdir /user/root/hivedemo3)新建Hive外部表,并指定数据存储位置为/user/root/hivedemo。hive> create external table names(id int,name string)
> ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
> LOCATION '/user/root/hivedemo';
OK
Time taken: 0.206 seconds4)把数据导入Hive的外部表names表中。hive> load data inpath '/user/root/names.txt' into table names;
Loading data to table default.names
Table default.names stats: \[numFiles=0, numRows=0, totalSize=0, rawDataSize=0\]
OK
Time taken: 0.451 seconds5)查看表中的数据。hive> select * from names;
OK
0 Rich
1 Barry
2 George
3 Ulf
4 Danielle
5 Tom
6 manish
7 Brian
8 Mark
Time taken: 0.102 seconds, Fetched: 9 row(s)
hive> dfs -ls hivedemo;
Found 1 items
-rwxr-xr-x 3 root supergroup 78 2015-08-17 11:11 hivedemo/names.txt
hive> dfs -ls /user/hive/warehouse;这里可以看到表中有数据,同时数据存储在指定的/user/root/hivedemo中,并没有存储在默认的/user/hive/warehouse中。
6)删除表。hive> drop table names;
OK
Time taken: 0.136 seconds
hive> show tables;
OK
Time taken: 0.049 seconds
hive> dfs -ls hivedemo;
Found 1 items
-rwxr-xr-x 3 root supergroup 78 2015-08-17 11:11 hivedemo/names.txt
这里可以看到虽然表已经删除了,但是HDFS中的数据并没有删除。