Hadoop3.2.0使用详解

  1. 云栖社区>
  2. 博客>
  3. 正文

Hadoop3.2.0使用详解

优惠券活动 2019-04-23 07:34:55 浏览2561
展开阅读全文

Hadoop3.2.0使用详解
1.概述
Hadoop3已经发布很久了,迭代集成的一些新特性也是很有用的。截止本篇博客书写为止,Hadoop发布了3.2.0。接下来,笔者就为大家分享一下在使用Hadoop3中遇到到一些问题,以及解决方法。

2.内容
2.1 基础软件包
在使用这些组件时,我们需要做一些准备工作,内容如下:

Hadoop-3.2.0安装包(建议Hadoop-3.2.0源代码也一起下载,后面步骤需要用到)
Maven-3.6.1(编译Hadoop-3.2.0源代码)
ProtoBuf-2.5.0(编译Hadoop-3.2.0源代码)
2.2 部署环境
SSH,用户创建,免密登录等这些操作这里就不介绍了,大家可以参考这篇博客【配置高可用的Hadoop平台】。在部署用户下配置好Hadoop的环境变量,例如HADOOP_HOME、HADOOP_CONF_DIR等。

2.2.1 配置环境变量
具体内容如下:

复制代码
vi ~/.bash_profile

编辑如下变量

export MAVEN_OPTS="-Xms256m -Xmx512m"
export JAVA_HOME=/data/soft/new/jdk
export HADOOP_HOME=/data/soft/new/hadoop
export HADOOP_CONF_DIR=/data/soft/new/hadoop-config
export HADOOP_YARN_HOME=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME

export PATH=$PATH:$JAVA_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$MAVEN_OPTS:$HBASE_HOME/bin
复制代码
2.2.2 编译Hadoop-3.2.0源代码
为什么需要编译Hadoop-3.2.0源代码,因为在使用Hadoop-3.2.0时,提交任务到YARN时,可能会出现如下异常:

复制代码
2019-04-21 22:47:45,307 ERROR [main] org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Error starting MRAppMaste
r
org.apache.hadoop.yarn.exceptions.YarnRuntimeException: java.lang.NullPointerException

    at org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator.register(RMCommunicator.java:178)
    at org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator.serviceStart(RMCommunicator.java:122)
    at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.serviceStart(RMContainerAllocator.java:280)
    at org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)
    at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$ContainerAllocatorRouter.serviceStart(MRAppMaster.java:979)
    at org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)
    at org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:121)
    at org.apache.hadoop.mapreduce.v2.app.MRAppMaster.serviceStart(MRAppMaster.java:1293)
    at org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)
    at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$6.run(MRAppMaster.java:1761)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:422)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
    at org.apache.hadoop.mapreduce.v2.app.MRAppMaster.initAndStartAppMaster(MRAppMaster.java:1757)
    at org.apache.hadoop.mapreduce.v2.app.MRAppMaster.main(MRAppMaster.java:1691)

Caused by: java.lang.NullPointerException

    at org.apache.hadoop.mapreduce.v2.app.client.MRClientService.getHttpPort(MRClientService.java:177)
    at org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator.register(RMCommunicator.java:159)
    ... 14 more

复制代码
阅读源代码,会发现这是由于org.apache.hadoop.yarn.server.webproxy.amfilter.AmFilterInitializer.java这个类中的一段代码引起的,涉及的代码如下:

复制代码
if (rmIds != null) {

  List<String> urls = new ArrayList<>();
  for (String rmId : rmIds) {
    String url = getUrlByRmId(yarnConf, rmId);
    // urls.add(url); // 注释掉这端代码,修改为下面的if语句判断是否为null
    if (url != null) {
      urls.add(url);
    }        
  }
  if (!urls.isEmpty()) {
    params.put(RM_HA_URLS, StringUtils.join(",", urls));
  }
}

复制代码
这与yarn-site.xml配置HA的兼容性有关,取决于 yarn.resourcemanager.webapp.address 和 yarn.resourcemanager.webapp.https.address 是否为空。

准备好Maven环境(建议使用最新的,因为在Hadoop-3.2.0中使用了Maven较新的Plugins插件),ProtoBuf的版本Hadoop还是使用的2.5.0,这里保持不变,在编译环境中配置好即可。然后,开始编译Hadoop-3.2.0源代码,执行命令如下:

为了加快编译速度,不编译单元测试和文档

mvn package -Pdist -DskipTests -Dtar -Dmaven.javadoc.skip=true
执行命令,等待编译结果,编译成功后,会出现如下所示的结果:

复制代码
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary for Apache Hadoop Main 3.2.0:
[INFO]
[INFO] Apache Hadoop Main ................................. SUCCESS [ 1.040 s]
[INFO] Apache Hadoop Build Tools .......................... SUCCESS [ 1.054 s]
[INFO] Apache Hadoop Project POM .......................... SUCCESS [ 0.845 s]
[INFO] Apache Hadoop Annotations .......................... SUCCESS [ 0.546 s]
[INFO] Apache Hadoop Assemblies ........................... SUCCESS [ 0.185 s]
[INFO] Apache Hadoop Project Dist POM ..................... SUCCESS [ 1.460 s]
[INFO] Apache Hadoop Maven Plugins ........................ SUCCESS [ 2.556 s]
[INFO] Apache Hadoop MiniKDC .............................. SUCCESS [ 0.529 s]
[INFO] Apache Hadoop Auth ................................. SUCCESS [ 2.412 s]
[INFO] Apache Hadoop Auth Examples ........................ SUCCESS [ 0.977 s]
[INFO] Apache Hadoop Common ............................... SUCCESS [ 28.555 s]
[INFO] Apache Hadoop NFS .................................. SUCCESS [ 1.319 s]
[INFO] Apache Hadoop KMS .................................. SUCCESS [ 11.622 s]
[INFO] Apache Hadoop Common Project ....................... SUCCESS [ 0.049 s]
[INFO] Apache Hadoop HDFS Client .......................... SUCCESS [05:37 min]
[INFO] Apache Hadoop HDFS ................................. SUCCESS [ 28.582 s]
[INFO] Apache Hadoop HDFS Native Client ................... SUCCESS [ 0.966 s]
[INFO] Apache Hadoop HttpFS ............................... SUCCESS [ 6.328 s]
[INFO] Apache Hadoop HDFS-NFS ............................. SUCCESS [ 0.859 s]
[INFO] Apache Hadoop HDFS-RBF ............................. SUCCESS [ 3.071 s]
[INFO] Apache Hadoop HDFS Project ......................... SUCCESS [ 0.035 s]
[INFO] Apache Hadoop YARN ................................. SUCCESS [ 0.039 s]
[INFO] Apache Hadoop YARN API ............................. SUCCESS [ 5.060 s]
[INFO] Apache Hadoop YARN Common .......................... SUCCESS [02:24 min]
[INFO] Apache Hadoop YARN Registry ........................ SUCCESS [ 1.147 s]
[INFO] Apache Hadoop YARN Server .......................... SUCCESS [ 0.041 s]
[INFO] Apache Hadoop YARN Server Common ................... SUCCESS [01:44 min]
[INFO] Apache Hadoop YARN NodeManager ..................... SUCCESS [ 4.143 s]
[INFO] Apache Hadoop YARN Web Proxy ....................... SUCCESS [ 0.921 s]
[INFO] Apache Hadoop YARN ApplicationHistoryService ....... SUCCESS [ 12.087 s]
[INFO] Apache Hadoop YARN Timeline Service ................ SUCCESS [ 4.518 s]
[INFO] Apache Hadoop YARN ResourceManager ................. SUCCESS [ 7.887 s]
[INFO] Apache Hadoop YARN Server Tests .................... SUCCESS [ 0.982 s]
[INFO] Apache Hadoop YARN Client .......................... SUCCESS [ 1.712 s]
[INFO] Apache Hadoop YARN SharedCacheManager .............. SUCCESS [ 0.919 s]
[INFO] Apache Hadoop YARN Timeline Plugin Storage ......... SUCCESS [ 1.269 s]
[INFO] Apache Hadoop YARN TimelineService HBase Backend ... SUCCESS [ 0.062 s]
[INFO] Apache Hadoop YARN TimelineService HBase Common .... SUCCESS [ 26.109 s]
[INFO] Apache Hadoop YARN TimelineService HBase Client .... SUCCESS [ 33.811 s]
[INFO] Apache Hadoop YARN TimelineService HBase Servers ... SUCCESS [ 0.041 s]
[INFO] Apache Hadoop YARN TimelineService HBase Server 1.2 SUCCESS [ 1.659 s]
[INFO] Apache Hadoop YARN TimelineService HBase tests ..... SUCCESS [ 44.305 s]
[INFO] Apache Hadoop YARN Router .......................... SUCCESS [ 1.186 s]
[INFO] Apache Hadoop YARN Applications .................... SUCCESS [ 0.049 s]
[INFO] Apache Hadoop YARN DistributedShell ................ SUCCESS [ 0.843 s]
[INFO] Apache Hadoop YARN Unmanaged Am Launcher ........... SUCCESS [ 0.571 s]
[INFO] Apache Hadoop MapReduce Client ..................... SUCCESS [ 0.136 s]
[INFO] Apache Hadoop MapReduce Core ....................... SUCCESS [ 3.399 s]
[INFO] Apache Hadoop MapReduce Common ..................... SUCCESS [ 1.819 s]
[INFO] Apache Hadoop MapReduce Shuffle .................... SUCCESS [ 1.289 s]
[INFO] Apache Hadoop MapReduce App ........................ SUCCESS [ 2.320 s]
[INFO] Apache Hadoop MapReduce HistoryServer .............. SUCCESS [ 1.450 s]
[INFO] Apache Hadoop MapReduce JobClient .................. SUCCESS [ 2.856 s]
[INFO] Apache Hadoop Mini-Cluster ......................... SUCCESS [ 0.969 s]
[INFO] Apache Hadoop YARN Services ........................ SUCCESS [ 0.041 s]
[INFO] Apache Hadoop YARN Services Core ................... SUCCESS [ 13.856 s]
[INFO] Apache Hadoop YARN Services API .................... SUCCESS [ 1.034 s]
[INFO] Apache Hadoop Image Generation Tool ................ SUCCESS [ 0.715 s]
[INFO] Yet Another Learning Platform ...................... SUCCESS [ 0.946 s]
[INFO] Apache Hadoop YARN Site ............................ SUCCESS [ 0.065 s]
[INFO] Apache Hadoop YARN UI .............................. SUCCESS [ 0.048 s]
[INFO] Apache Hadoop YARN Project ......................... SUCCESS [ 8.150 s]
[INFO] Apache Hadoop MapReduce HistoryServer Plugins ...... SUCCESS [ 0.525 s]
[INFO] Apache Hadoop MapReduce NativeTask ................. SUCCESS [ 0.931 s]
[INFO] Apache Hadoop MapReduce Uploader ................... SUCCESS [ 0.575 s]
[INFO] Apache Hadoop MapReduce Examples ................... SUCCESS [ 0.829 s]
[INFO] Apache Hadoop MapReduce ............................ SUCCESS [ 3.370 s]
[INFO] Apache Hadoop MapReduce Streaming .................. SUCCESS [ 6.949 s]
[INFO] Apache Hadoop Distributed Copy ..................... SUCCESS [ 1.523 s]
[INFO] Apache Hadoop Archives ............................. SUCCESS [ 0.392 s]
[INFO] Apache Hadoop Archive Logs ......................... SUCCESS [ 0.515 s]
[INFO] Apache Hadoop Rumen ................................ SUCCESS [ 0.807 s]
[INFO] Apache Hadoop Gridmix .............................. SUCCESS [ 0.774 s]
[INFO] Apache Hadoop Data Join ............................ SUCCESS [ 0.385 s]
[INFO] Apache Hadoop Extras ............................... SUCCESS [ 0.425 s]
[INFO] Apache Hadoop Pipes ................................ SUCCESS [ 0.055 s]
[INFO] Apache Hadoop OpenStack support .................... SUCCESS [ 0.688 s]
[INFO] Apache Hadoop Amazon Web Services support .......... SUCCESS [ 54.379 s]
[INFO] Apache Hadoop Kafka Library support ................ SUCCESS [ 3.304 s]
[INFO] Apache Hadoop Azure support ........................ SUCCESS [01:42 min]
[INFO] Apache Hadoop Aliyun OSS support ................... SUCCESS [ 3.943 s]
[INFO] Apache Hadoop Client Aggregator .................... SUCCESS [ 2.479 s]
[INFO] Apache Hadoop Scheduler Load Simulator ............. SUCCESS [ 1.577 s]
[INFO] Apache Hadoop Resource Estimator Service ........... SUCCESS [ 5.400 s]
[INFO] Apache Hadoop Azure Data Lake support .............. SUCCESS [02:40 min]
[INFO] Apache Hadoop Tools Dist ........................... SUCCESS [ 7.984 s]
[INFO] Apache Hadoop Tools ................................ SUCCESS [ 0.056 s]
[INFO] Apache Hadoop Client API ........................... SUCCESS [01:18 min]
[INFO] Apache Hadoop Client Runtime ....................... SUCCESS [ 51.046 s]
[INFO] Apache Hadoop Client Packaging Invariants .......... SUCCESS [ 1.265 s]
[INFO] Apache Hadoop Client Test Minicluster .............. SUCCESS [01:41 min]
[INFO] Apache Hadoop Client Packaging Invariants for Test . SUCCESS [ 0.172 s]
[INFO] Apache Hadoop Client Packaging Integration Tests ... SUCCESS [ 0.146 s]
[INFO] Apache Hadoop Distribution ......................... SUCCESS [ 23.411 s]
[INFO] Apache Hadoop Client Modules ....................... SUCCESS [ 0.045 s]
[INFO] Apache Hadoop Cloud Storage ........................ SUCCESS [ 0.649 s]
[INFO] Apache Hadoop Cloud Storage Project ................ SUCCESS [ 0.060 s]
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 24:49 min
[INFO] Finished at: 2019-04-22T03:21:56+08:00
[INFO] ------------------------------------------------------------------------
复制代码
最后,在hadoop-dist/target目录中将编译后的/hadoop-dist/target/hadoop-3.2.0/share/hadoop/yarn/hadoop-yarn-server-web-proxy-3.2.0.jar包上传到Hadoop集群中,替换$HADOOP_HOME/share/hadoop/yarn中的jar包。

2.2.3 配置Hadoop文件
之前,介绍过Hadoop2的配置文件,这次为大家重新整理了一份Hadoop3的配置文件,具体内容如下:

1.hdfs-site.xml
复制代码
<?xml version="1.0" encoding="UTF-8"?>

<property>
    <name>dfs.nameservices</name>
    <value>cluster1</value>
</property>
<property>
    <name>dfs.ha.namenodes.cluster1</name>
    <value>nna,nns</value>
</property>
<property>
    <name>dfs.namenode.rpc-address.cluster1.nna</name>
    <value>nna:9820</value>
</property>
<property>
    <name>dfs.namenode.rpc-address.cluster1.nns</name>
    <value>nns:9820</value>
</property>
<property>
    <name>dfs.namenode.http-address.cluster1.nna</name>
    <value>nna:9870</value>
</property>
<property>
    <name>dfs.namenode.http-address.cluster1.nns</name>
    <value>nns:9870</value>
</property>
<property>
    <name>dfs.namenode.shared.edits.dir</name>
    <value>qjournal://dn1:8485;dn2:8485;dn3:8485/cluster1</value>
</property>
<property>
    <name>dfs.client.failover.proxy.provider.cluster1</name>
    <value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>
</property>
<property>
    <name>dfs.ha.fencing.methods</name>
    <value>sshfence</value>
</property>
<property>
    <name>dfs.ha.fencing.ssh.private-key-files</name>
    <value>/home/hadoop/.ssh/id_rsa</value>
</property>
<!-- 如果条件允许,建议挂在独立磁盘 -->
<property>
    <name>dfs.journalnode.edits.dir</name>
    <value>/data/soft/new/dfs/journal</value>
</property>
<property>
    <name>dfs.ha.automatic-failover.enabled</name>
    <value>true</value>
</property>
<!-- 如果条件允许,建议挂在独立磁盘 -->
<property>
    <name>dfs.namenode.name.dir</name>
    <value>/data/soft/new/dfs/name</value>
</property>
<!-- 实际物理机中会有若干块独立磁盘,以英文逗号分隔即可 -->
<property>
    <name>dfs.datanode.data.dir</name>
    <value>/data/soft/new/dfs/data</value>
</property>
<!-- 副本视情况而定进行设置,HDFS空间充足可设置为3 -->
<property>
    <name>dfs.replication</name>
    <value>2</value>
</property>
<property>
    <name>dfs.webhdfs.enabled</name>
    <value>true</value>
</property>
<property>
    <name>dfs.journalnode.http-address</name>
    <value>0.0.0.0:8480</value>
</property>
<property>
    <name>dfs.journalnode.rpc-address</name>
    <value>0.0.0.0:8485</value>
</property>
<property>
    <name>ha.zookeeper.quorum</name>
    <value>dn1:2181,dn2:2181,dn3:2181</value>
</property>


复制代码

2.core-site.xml
复制代码
<?xml version="1.0" encoding="UTF-8"?>

<property>
    <name>fs.defaultFS</name>
    <value>hdfs://cluster1</value>
</property>
<property>
    <name>io.file.buffer.size</name>
    <value>131072</value>
</property>
<property>
    <name>hadoop.tmp.dir</name>
    <value>/data/soft/new/dfs/tmp</value>
</property>
<property>
    <name>hadoop.proxyuser.root.hosts</name>
    <value>*</value>
</property>
<property>
    <name>hadoop.proxyuser.root.groups</name>
    <value>*</value>
</property>
<property>
    <name>ha.zookeeper.quorum</name>
    <value>dn1:2181,dn2:2181,dn3:2181</value>
</property>


复制代码
3.mapred-site.xml
复制代码
<?xml version="1.0" encoding="UTF-8"?>

<property>
    <name>mapreduce.framework.name</name>
    <value>yarn</value>
</property>
<property>
    <name>mapreduce.jobhistory.address</name>
    <value>0.0.0.0:10020</value>
</property>
<property>
    <name>mapreduce.jobhistory.webapp.address</name>
    <value>nna:19888</value>
</property>
<property>
    <name>yarn.app.mapreduce.am.resource.mb</name>
    <value>512</value>
</property>
<property>
    <name>mapreduce.map.memory.mb</name>
    <value>512</value>
</property>
<property>
    <name>mapreduce.map.java.opts</name>
    <value>-Xmx512M</value>
</property>
<property>
    <name>mapreduce.reduce.memory.mb</name>
    <value>512</value>
</property>
<property>
    <name>mapreduce.reduce.java.opts</name>
    <value>-Xmx512M</value>
</property>
<property>
    <name>mapred.child.java.opts</name>
    <value>-Xmx512M</value>
</property>
<!-- 加载依赖JAR包和配置文件 -->

  <name>mapreduce.application.classpath</name>
  <value>/data/soft/new/hadoop-config,/data/soft/new/hadoop/share/hadoop/common/*,/data/soft/new/hadoop/share/hadoop/common/lib/*,/data/soft/new/hadoop/share/hadoop/hdfs/*,/data/soft/new/hadoop/share/hadoop/hdfs/lib/*,/data/soft/new/hadoop/share/hadoop/yarn/*,/data/soft/new/hadoop/share/hadoop/yarn/lib/*,/data/soft/new/hadoop/share/hadoop/mapreduce/*,/data/soft/new/hadoop/share/hadoop/mapreduce/lib/*</value>
</property>


复制代码
4.yarn-site.xml
View Code
5.fair-scheduler.xml
复制代码
<?xml version="1.0"?>

<queue name="root">
    <aclSubmitApps>hadoop</aclSubmitApps>
    <aclAdministerApps>hadoop</aclAdministerApps>
    <!-- 默认队列设置CPU和内存 -->
    <queue name="default">
        <maxRunningApps>10</maxRunningApps>
        <minResources>1024mb,1vcores</minResources>
        <maxResources>6144mb,6vcores</maxResources>
        <schedulingPolicy>fair</schedulingPolicy>
        <weight>1.0</weight>
        <aclSubmitApps>hadoop</aclSubmitApps>
        <aclAdministerApps>hadoop</aclAdministerApps>
    </queue>
    <!-- 队列queue_1024_01设置CPU和内存 -->
    <queue name="queue_1024_01">
        <maxRunningApps>10</maxRunningApps>
        <minResources>1024mb,1vcores</minResources>
        <maxResources>4096mb,3vcores</maxResources>
        <schedulingPolicy>fair</schedulingPolicy>
        <weight>1.0</weight>
        <aclSubmitApps>hadoop</aclSubmitApps>
        <aclAdministerApps>hadoop</aclAdministerApps>
    </queue>
</queue>

<fairSharePreemptionTimeout>600000</fairSharePreemptionTimeout>
<defaultMinSharePreemptionTimeout>600000</defaultMinSharePreemptionTimeout>


复制代码
这里需要注意是,在Hadoop2中存储DataNode节点地址的是slaves文件,在Hadoop3中替换为workers文件了。

3.启动Hadoop3
首次启动Hadoop3时,需要注册ZK和格式化NameNode,具体操作如下:

复制代码

1.启动JournalNode进程(QJM使用)

hadoop-daemon.sh start journalnode

2.注册ZK

hdfs zkfc -formatZK

3.格式化NameNode

hdfs namenode -format

4.启动NameNode

hadoop-daemon.sh start namenode

5.在Standby节点同步元数据

hdfs namenode -bootstrapStandby

6.启动HDFS和YARN

start-dfs.sh
start-yarn.sh

7.启动historyserver(在Hadoop3中proxyserver已集成到YARN的启动脚本中了)

mr-jobhistory-daemon.sh start historyserver
复制代码
4.提交测试用例
在$HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.2.0.jar中,提供了示例程序,验证WordCount算法,操作如下:

复制代码

1.准备数据源

vi /tmp/wc

a a
c s

2.上传到HDFS

hdfs dfs -put /tmp/wc /tmp

3.提交WordCount任务

hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.2.0.jar wordcount /tmp/wc /tmp/res

4.查看统计结果

hdfs dfs -cat /tmp/res/part-r-00000
复制代码
5.预览
5.1 HDFS页面结果

5.2 YARN页面结果

5.3 队列页面结果

6.总结
在编译Hadoop-3.2.0源代码的时候,需要注意Maven远程仓库地址的配置,通常会由于Maven远程仓库地址不可用,导致依赖JAR下载失败,从而无法正常编译。在Maven的settings.xml文件中配置可用的Maven远程地址即可。

7.结束语
这篇博客就和大家分享到这里,如果大家在研究学习的过程当中有什么问题,可以加群进行讨论或发送邮件给我,我会尽我所能为您解答,与君共勉!

另外,博主出书了《Kafka并不难学》和《Hadoop大数据挖掘从入门到进阶实战》,喜欢的朋友或同学, 可以在公告栏那里点击购买链接购买博主的书进行学习,在此感谢大家的支持。关注下面公众号,根据提示,可免费获取书籍的教学视频。
原文地址https://www.cnblogs.com/smartloli/p/10753998.html

网友评论

作者关闭了评论
优惠券活动
+ 关注