Hadoop安装 依赖
机器环境配置 ~/.bashrc
这里所有的设置都只是设置环境变量
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 export HADOOP_PREFIX=/opt/hadoop # 自定义部分 # 此处是直接解压放在`/opt`目录下 export HADOOP_HOME=$HADOOP_PREFIX export HADOOP_COMMON_HOME=$HADOOP_PREFIX # hadoop common export HADOOP_HDFS_HOME=$HADOOP_PREFIX # hdfs export HADOOP_MAPRED_HOME=$HADOOP_PREFIX # mapreduce export HADOOP_YARN_HOME=$HADOOP_PREFIX # YARN export HADOOP_CONF_DIR=$HADOOP_PREFIX/etc/hadoop export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native export HADOOP_OPTS="$HADOOP_OPTS -Djava.library.path=$HADOOP_COMMON_LIB_NATIVE_DIR" # 这里`-Djava`间不能有空格 export CLASSPATH=$CLASS_PATH:$HADOOP_PREFIX/lib/* export PATH=$PATH:$HADOOP_PREFIX/sbin:$HADOOP_PREFIX/bin
/etc/hosts
1 2 3 4 192.168.31.129 hd-master 192.168.31.130 hd-slave1 192.168.31.131 hd-slave2 127.0.0.1 localhost
这里配置的ip地址是各个主机的ip,需要自行配置
hd-master
、hd-slave1
等就是主机ip-主机名映射
todo?一定需要在/etc/hostname
中设置各个主机名称
firewalld
必须关闭所有节点的防火墙
1 2 $ sudo systemctl stop firewalld.service $ sudo systemctl disable firewalld.service
文件夹建立
1 2 $ mkdir tmp $ mkdir -p hdfs/data hdfs/name
Hadoop配置 Hadoop全系列 (包括hive、tez等)配置取决于以下两类配置文件
Hadoop集群有三种运行模式
Standalone Operation
Pseudo-Distributed Operation
Fully-Distributed Operation
针对不同的运行模式有,hadoop有三种不同的配置方式
Standalone Operation hadoop被配置为以非分布模式运行的一个独立Java进程,对调试有
帮助
测试 1 2 3 4 5 $ cd /path/to/hadoop $ mkdir input $ cp etc/hadoop/*.xml input $ bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.1.1.jar grep input output 'dfs[a-z.]+' $ cat output/*
Pseudo-Distributed Operation 在单节点(服务器)上以所谓的伪分布式模式运行,此时每个Hadoop
守护进程作为独立的Java进程运行
core-site.xml
1 2 3 4 5 6 <configuration > <property > <name > fs.defaultFS</name > <value > hdfs://localhost:9000</value > </property > </configuration >
hdfs-site.xml
1 2 3 4 5 6 <configuration > <property > <name > dfs.replication</name > <value > 1</value > </property > </configuration >
mapred-site.xml
1 2 3 4 5 6 7 8 9 10 11 12 13 <configuration > <property > <name > mapreduce.framework.name</name > <value > yarn</value > </property > </configruration > <configuration > <property > <name > mapreduce.application.classpath</name > <value > $HADOOP_HOME/share/hadoop/mapreduce/*:$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/lib/*</value > </preperty > </configruation >
yarn-site.xml
1 2 3 4 5 6 7 8 9 10 <configuration> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> <property> <name>yarn.nodemanager.env-whitelist</name> <value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PREPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAPRED_HOME</value> </property> </configuration>
Fully-Distributed Operation
单节点配置完hadoop之后,需要将其同步到其余节点
core-site.xml
模板:core-site.xml
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 <configuration > <property > <name > fs.defaultFS</name > <value > hdfs://hd-master:9000</value > <description > namenode address</description > </property > <property > <name > hadoop.tmp.dir</name > <value > file:///opt/hadoop/tmp</value > </property > <property > <name > io.file.buffer.size</name > <value > 131702</value > </property > <property > <name > hadoop.proxyuser.root.hosts</name > <value > *</value > </property > <property > <name > hadoop.proxyuser.root.groups</name > <value > *</value > </property > </configuration >
hdfs-site.xml
模板:hdfs-site.xml
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 <configuration > <property > <name > dfs.namenode.secondary.http-address</name > <value > hd-master:9001</value > </property > <property > <name > dfs.namenode.name.dir</name > <value > file:///opt/hadoop/hdfs/name</value > <description > namenode data directory</description > </property > <property > <name > dfs.datanode.data.dir</name > <value > file:///opt/hadoop/hdfs/data</value > <description > datanode data directory</description > </property > <property > <name > dfs.replication</name > <value > 2</value > <description > replication number</description > </property > <property > <name > dfs.webhdfs.enabled</name > <value > true</value > </property > <property > <name > dfs.datanode.directoryscan.throttle.limit.ms.per.sec</name > <value > 1000</value > </property > </configuration >
yarn-site.xml
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 <configuration > <property > <name > yarn.resourcemanager.hostname</name > <value > hd-master</value > </property > <property > <name > yarn.resourcemanager.address</name > <value > hd-master:9032</value > </property > <property > <name > yarn.resourcemanager.scheduler.address</name > <value > hd-master:9030</value > </property > <property > <name > yarn.resourcemanager.resource-tracker.address</name > <value > hd-master:9031</value > </property > <property > <name > yarn.resourcemanager.admin.address</name > <value > hd-master:9033</value > </property > <property > <name > yarn.resourcemanager.webapp.address</name > <value > hd-master:9099</value > </property > <property > <name > yarn.scheduler.maximum-allocation-mb</name > <value > 512</value > <description > maximum memory allocation per container</description > </property > <property > <name > yarn.scheduler.minimum-allocation-mb</name > <value > 256</value > <description > minimum memory allocation per container</description > </property > <property > <name > yarn.nodemanager.resource.memory-mb</name > <value > 1024</value > <description > maximium memory allocation per node</description > </property > <property > <name > yarn.nodemanager.vmem-pmem-ratio</name > <value > 8</value > <description > virtual memmory ratio</description > </property > <property > <name > yarn.app.mapreduce.am.resource.mb</name > <value > 384</value > </property > <property > <name > yarn.app.mapreduce.am.command-opts</name > <value > -Xms128m -Xmx256m</value > </property > <property > <name > yarn.nodemanager.vmem-check-enabled</name > <value > false</value > </property > <property > <name > yarn.nodemanager.resource.cpu-vcores</name > <value > 1</value > </property > <property > <name > yarn.nodemanager.aux-services</name > <value > mapreduce_shuffle</value > </property > <property > <name > yarn.nodemanager.aux-services.mapreduce.shuffle.class</name > <value > org.apache.hadoop.mapred.ShuffleHandler</value > </property > </configuration >
mapred-site.xml
模板:mapred-site.xml.template
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 <configuration > <property > <name > mapreduce.framework.name</name > <value > yarn</value > </property > <property > <name > mapreduce.jobhistory.address</name > <value > hd-master:10020</value > </property > <property > <name > mapreduce.jobhistory.webapp.address</name > <value > hd-master:19888</value > </property > <property > <name > mapreduce.map.memory.mb</name > <value > 256</value > <description > memory allocation for map task, which should between minimum container and maximum</description > </property > <property > <name > mapreduce.reduce.memory.mb</name > <value > 256</value > <description > memory allocation for reduce task, which should between minimum container and maximum</description > </property > <property > <name > mapreduce.map.java.opts</name > <value > -Xms128m -Xmx256m</value > </property > <property > <name > mapreduce.reduce.java.opts</name > <value > -Xms128m -Xmx256m</value > </property > </configuration >
参数说明
yarn.scheduler.minimum-allocation-mb
:container内存
单位,也是container分配的内存最小值
yarn.scheduler.maximum-allocation-mb
:container内存
最大值,应该为最小值整数倍
mapreduce.map.memeory.mb
:map task的内存分配
hadoop2x中mapreduce构建于YARN之上,资源由YARN统一管理
所以maptask任务的内存应设置container最小值、最大值间
否则分配一个单位,即最小值container
mapreduce.reduce.memeory.mb
:reduce task的内存分配
*.java.opts
:JVM进程参数设置
每个container(其中执行task)中都会运行JVM进程
-Xmx...m
:heap size最大值设置,所以此参数应该小于
task(map、reduce)对应的container分配内存的最大值,
如果超出会出现physical memory溢出
-Xms...m
:heap size最小值?#todo
yarn.nodemanager.vmem-pmem-ratio
:虚拟内存比例
以上所有配置都按照此参数放缩
所以在信息中会有physical memory、virtual memory区分
yarn.nodemanager.resource.memory-mb
:节点内存设置
yarn.app.mapreduce.am.resource.mb
:每个Application
Manager分配的内存大小
主从文件 masters
slaves
环境设置文件
这里环境设置只是起补充作用,在~/.bashrc
已经设置的
环境变量可以不设置
但是在这里设置环境变量,然后把整个目录同步到其他节点,
可以保证在其余节点也能同样的设置环境变量
hadoop-env.sh
设置JAVA_HOME
为Java安装根路径
hdfs-env.sh
设置JAVA_HOME
为Java安装根路径
yarn-env.sh
设置JAVA_HOME
为Java安装根路径
1 2 JAVA_HOME=/opt/java/jdk JAVA_HEAP_MAX=Xmx3072m
初始化、启动、测试 HDFS
格式化、启动
1 2 3 4 5 6 $ hdfs namenode -format # 格式化文件系统 $ start-dfs.sh # 启动NameNode和DataNode # 此时已可访问NameNode,默认http://localhost:9870/ $ stop-dfs.sh
测试
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 $ hdfs dfsadmin -report # 应该输出3个节点的情况 $ hdfs dfs -mkdir /user $ hdfs dfs -mkdir /user/<username> # 创建执行MapReduce任务所需的HDFS文件夹 $ hdfs dfs -mkdir input $ hdfs dfs -put etc/hadoop/*.xml input # 复制文件至分布式文件系统 $ hadoop jar /opt/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.7.jar grep input output 'dfs[a-z]+' # 执行自带样例 # 样例名称取决于版本 $ hdfs dfs -get output outut $ cat output/* # 检查输出文件:将所有的输出文件从分布式文件系统复制 # 至本地文件系统,并检查 $ hdfs dfs -cat output/* # 或者之间查看分布式文件系统上的输出文件 $ hadoop jar /opt/hadoop/share/hadoop/tools/lib/hadoop-streaming-2.7.7.jar \ -input /path/to/hdfs_file \ -output /path/to/hdfs_dir \ -mapper "/bin/cat" \ -reducer "/user/bin/wc" \ -file /path/to/local_file \ -numReduceTasks 1
YARN 1 2 3 4 5 $ sbin/start-yarn.sh # 启动ResourceManger守护进程、NodeManager守护进程 # 即可访问ResourceManager的web接口,默认:http://localhost:8088/ $ sbin/stop-yarn.sh # 关闭守护进程
其他 注意事项
可能错误 节点启动不全
文件无法写入
could only be replicated to 0 nodes instead of minReplication (=1). There are 2 datanode(s) running and 2 node(s) are excluded in this operation.
原因
未关闭防火墙
存储空间不够
节点状态不一致、启动不全
在log里面甚至可能会出现一个连接超时1000ms的ERROR
处理
关闭服务、删除存储数据的文件夹dfs/data
、格式化
namenode
尝试修改节点状态信息文件VERSION
一致
${hadoop.tmp.dir}
${dfs.namenode.name.dir}
${dfs.datanode.data.dir}
Unhealthy Node
1/1 local-dirs are bad: /opt/hadoop/tmp/nm-local-dir; 1/1 log-dirs are bad: /opt/hadoop/logs/userlogs
常用命令 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 scp -r /opt/hadoop/etc/hadoop centos2:/opt/hadoop/etc scp -r /opt/hadoop/etc/hadoop centos3:/opt/hadoop/etc # 同步配置 scp /root/.bashrc centos2:/root scp /root/.bashrc centos3:/root # 同步环境 rm -r /opt/hadoop/tmp /opt/hadoop/hdfs mkdir -p /opt/hadoop/tmp /opt/hadoop/hdfs ssh centos2 rm -r /opt/hadoop/tmp /opt/hadoop/hdfs ssh centos2 mkdir -p /opt/hadoop/tmp /opt/hadoop/hdfs/name /opt/hadoop/hdfs/data ssh centos3 rm -r /opt/hadoop/tmp /opt/hadoop/hdfs/name /opt/hadoop/data ssh centos3 mkdir -p /opt/hadoop/tmp /opt/hadoop/hdfs/name /opt/hadoop/hdfs/data # 同步清除数据 rm -r /opt/hadoop/logs/* ssh centos2 rm -r /opt/hadoop/logs/* ssh centos3 rm -r /opt/hadoop/logs/* # 同步清除log
Hive 依赖
hadoop:配置完成hadoop,则相应java等也配置完成
关系型数据库:mysql、derby等
机器环境配置 ~/.bashrc
1 2 3 4 5 export HIVE_HOME=/opt/hive # self designed export HIVE_CONF_DIR=$HIVE_HOME/conf export PATH=$PATH:$HIVE_HOME/bin export CLASSPATH=$CLASS_PATH:$HIVE_HOME/lib/*
文件夹建立 HDFS 1 2 3 4 $ hdfs dfs -rm -r /user/hive $ hdfs dfs -mkdir -p /user/hive/warehouse /user/hive/tmp /user/hive/logs # 这三个目录与配置文件中对应 $ hdfs dfs -chmod 777 /user/hive/warehouse /user/hive/tmp /user/hive/logs
FS 1 2 3 4 5 6 $ mkdir data $ chmod 777 data # hive数据存储文件夹 $ mkdir logs $ chmod 777 logs # log 目录
Hive配置 XML参数 conf/hive-site.xml
模板:conf/hive-default.xml.template
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 <property > <name > javax.jdo.option.ConnectionURL</name > <value > jdbc:mysql://hd-master:3306/metastore_db?createDatabaseIfNotExist=true</value > </property > <property > <name > javax.jdo.option.ConnectionDriverName</name > <value > org.mariadb.jdbc.Driver</value > </property > <property > <name > javax.jdo.option.ConnectionUserName</name > </value > hive</value > </property > <property > <name > javax.jdo.option.ConnectionPassword</name > <value > 1234</value > </property > <property > <name > hive.metastore.warehouse.dir</name > <value > /user/hive/warehouse</value > </property > <property > <name > hive.exec.scratchdir</name > <value > /user/hive/tmp</value > </property > <property > <name > system:java.io.tmpdir</name > <value > /opt/hive/tmp</value > </property > <property > <name > system:user.name</name > <value > hive</value > <property > <property > <name > hive.metastore.uris</name > <value > thrift://192.168.31.129:19083</value > </property > <property > <name > hive.server2.logging.operation.enabled</name > <value > true</value > </property >
/user
开头的路径一般表示hdfs中的路径,而${}
变量开头
的路径一般表示本地文件系统路径
变量system:java.io.tmpdir
、system:user.name
在
文件中需要自己设置,这样就避免需要手动更改出现这些
变量的地方
hive.querylog.location
设置在本地更好,这个日志好像
只在hive启动时存在,只是查询日志,不是hive运行日志,
hive结束运行时会被删除,并不是没有生成日志、${}
表示
HDFS路径
配置中出现的目录(HDFS、locaL)有些手动建立
hive.metastore.uris
若配置,则hive会通过metastore服务
访问元信息
使用hive前需要启动metastore服务
并且端口要和配置文件中一样,否则hive无法访问
环境设置文件 conf/hive-env.sh
模板:conf/hive-env.sh.template
1 2 3 4 5 export JAVA_HOME=/opt/java/jdk export HADOOP_HOME=/opt/hadoop export HIVE_CONF_DIR=/opt/hive/conf # 以上3者若在`~/.bashrc`中设置,则无需再次设置 export HIVE_AUX_JARS_PATH=/opt/hive/lib
conf/hive-exec-log4j2.properties
conf/hive-log4j2.properties
模板:hive-log4j2.properties.template
MariaDB
初始化数据库 1 $ schematool -initSchema -dbType mysql
这个命令要在所有配置完成之后执行
服务设置 1 2 3 4 5 6 7 8 9 10 11 $ hive --service metastore -p 19083 & # 启动metastore服务,端口要和hive中的配置相同 # 否则hive无法连接metastore服务,无法使用 # 终止metastore服务只能根据进程号`kill ` $ hive --service hiveserver2 --hiveconf hive.server2.thrift.port =10011 & # 启动JDBC Server # 此时可以通过JDBC Client(如beeline)连接JDBC Server对 # Hive中数据进行操作 $ hive --service hiveserver2 --stop # 停止JDBC Server # 或者直接kill
测试 Hive可用性 需要先启动hdfs、YARN、metastore database(mysql),如果有
设置独立metastore server,还需要在正确端口启动
1 2 3 4 5 6 hive> create table if not exists words(id INT , word STRING) row format delimited fields terminated by " " lines terminated by "\n"; hive> load data local inpath "/opt/hive-test.txt" overwrite into table words; hive> select * from words;
JDBCServer可用性
命令行连接
1 $ beeline -u jdbc:hive2://localhost:10011 -n hive -p 1234
beeline中连接
1 2 3 $ beeline beeline> !connect jdbc:hive2://localhost:10011 # 然后输入用户名、密码(metastore数据库用户名密码)
其他 可能错误
Failed with exception Unable to move source file
linux用户权限问题,无法操作原文件
hdfs用户权限问题,无法写入目标文件
hdfs配置问题,根本无法向hdfs写入:参见hdfs问题
org.apache.hive.service.cli.HiveSQLException: Couldn’t find log associated with operation handle:
User: root is not allowed to impersonate hive
Tez 依赖
机器环境配置 .bashrc
1 2 3 4 5 6 7 8 9 10 11 12 13 14 export TEZ_HOME=/opt/tez export TEZ_CONF_DIR=$TEZ_HOME/conf for jar in `ls $TEZ_HOME | grep jar`; do export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:$TEZ_HOME/$jar done for jar in `ls $TEZ_HOME/lib`; do export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:$TEZ_HOME/lib/$jar done # this part could be replaced with line bellow export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:$TEZ_HOME/*:$TEZ_HOME/lib/* # `hadoop-env.sh`中说`HADOOP_CLASSPATH`是Extra Java CLASSPATH # elements # 这意味着hadoop组件只需要把其jar包加到`HADOOP_CLASSPATH`中既可
HDFS
HadoopOnTez 在hadoop中配置Tez
XML参数 tez-site.xml
模板:conf/tez-default-tmplate.xml
好像还是需要复制到hadoop的配置文件夹中
1 2 3 4 5 6 7 8 9 10 11 <property> <name>tez.lib.uris</name> <value>${fs.defaultFS}/apps/tez.tar.gz</value> <!--设置tez安装包位置--> </property> <!-- <property> <name>tez.container.max.java.heap.fraction</name> <value>0.2</value> <property> 内存不足时-->
mapred-site.xml
修改mapred-site.xml
文件:配置mapreduce基于yarn-tez
,
(配置修改在hadoop部分也有)
1 2 3 4 <property > <name > mapreduce.framework.name</name > <value > yarn-tez</value > </property >
环境参数 HiveOnTez
Hive设置
若已经修改了mapred-site.xml
设置全局基于tez,则无需复制
jar包,直接修改hive-site.xml
即可
Jar包复制 复制$TEZ_HOME
、$TEZ_HOME/lib
下的jar包到$HIVE_HOME/lib
下即可
hive-site.xml
1 2 3 4 <property> <name>hive.execution.engine</name> <value>tez</value> </property>
其他 可能错误
SLF4J: Class path contains multiple SLF4J bindings.
Spark 依赖
java
scala
python:一般安装anaconda,需要额外配置1 2 export PYTHON_HOME=/opt/anaconda3 export PATH=$PYTHON_HOME/bin:$PATH
相应资源管理框架,如果不以standalone模式运行
机器环境配置 ~/.bashrc
1 2 3 4 5 6 7 export SPARK_HOME=/opt/spark export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin export PYTHON_PATH=$PYTHON_PATH:$SPARK_HOME/python:$SPARK_HOME/python/lib/* # 把`pyshark`、`py4j`模块对应的zip文件添加进路径 # 这里用的是`*`通配符应该也可以,手动添加所有zip肯定可以 # 否则无法在一般的python中对spark进行操作 # 似乎只要master节点有设置`/lib/*`添加`pyspark`、`py4j`就行
Standalone 环境设置文件 conf/spark-env.sh
模板:conf/spark-env.sh.template
这里应该有些配置可以省略、移除#todo
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 export JAVA_HOME=/opt/jdk export HADOOP_HOME=/opt/hadoop export hADOOP_CONF_DIR=/opt/hadoop/etc/hadoop export HIVE_HOME=/opt/hive export SCALA_HOME=/opt/scala export SCALA_LIBRARY=$SPARK_HOME/lib # `~/.bashrc`设置完成之后,前面这段应该就这个需要设置 export SPARK_HOME=/opt/spark export SPARK_DIST_CLASSPATH=$(hadoop classpath) # 这里是执行命令获取classpath # todo # 这里看文档的意思,应该也是类似于`$HADOOP_CLASSPATH ` # 可以直接添加进`$CLASSPATH `而不必设置此变量 export SPARK_LIBRARY_PATH=$SPARK_HOME/lib export SPARK_MASTER_HOST=hd-master export SPARK_MASTER_PORT=7077 export SPARK_MASTER_WEBUI_PORT=8080 export SPARK_WORKER_WEBUI_PORT=8081 export SPARK_WORKER_MEMORY=1024m # spark能在一个container内执行多个task export SPARK_LOCAL_DIRS=$SPARK_HOME/data # 需要手动创建 export SPARK_MASTER_OPTS= export SPARK_WORKER_OPTS= export SPARK_DAEMON_JAVA_OPTS= export SPARK_DAEMON_MEMORY= export SPARK_DAEMON_JAVA_OPTS=
文件夹建立 1 2 $ mkdir /opt/spark/spark_data # for `$SPARK_LOCAL_DIRS `
Spark配置 conf/slaves
文件不存在,则在当前主机单节点运行
conf/hive-site.xml
这里只是配置Spark,让Spark作为“thrift客户端”能正确连上
metastore server
模板:/opt/hive/conf/hive-site.xml
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 <?xml version="1.0" encoding="UTF-8" standalone="yes"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration > <property > <name > hive.metastore.uris</name > <value > thrift://192.168.31.129:19083</value > <description > Thrift URI for the remote metastor. Used by metastore client to connect to remote metastore</description > </property > <property > <name > hive.server2.thrift.port</name > <value > 10011</value > </property > <property > <name > hive.server2.thrift.bind.host</name > <value > hd-master</value > </property > </configuration >
测试 启动Spark服务 需要启动hdfs、正确端口启动的metastore server
1 2 3 4 5 6 7 8 9 10 $ start-master.sh # 在执行**此命令**机器上启动master实例 $ start-slaves.sh # 在`conf/slaves`中的机器上启动worker实例 $ start-slave.sh # 在执行**此命令**机器上启动worker实例 $ stop-master.sh $ stop-slaves.sh $ stop-slave.sh
启动Spark Thrift Server 1 2 3 4 5 6 7 $ start-thriftserver.sh --master spark://hd-master:7077 \ --hiveconf hive.server2.thrift.bind.host hd-master \ --hiveconf hive.server2.thrift.port 10011 # 这里在命令行启动thrift server时动态指定host、port # 如果在`conf/hive-site.xml`有配置,应该不需要 # 然后使用beeline连接thrift server,同hive
Spark-Sql测试 1 2 3 4 5 6 $ spark- sql # 在含有配置文件的节点上启动时,配置文件中已经指定`MASTER` # 因此不需要指定后面配置 spark- sql > set spark.sql.shuffle.partitions= 20 ; spark- sql > select id, count (* ) from words group by id order by id;
pyspark测试 1 2 3 4 5 6 7 8 9 10 11 12 13 14 $ MASTER=spark://hd-master:7077 pyspark from pyspark.sql import HiveContextsql_ctxt = HiveContext(sc) ret = sql_ctxt.sql("show tables" ).collect() file = sc.textFile("hdfs://hd-master:9000/user/root/input/capacity-scheduler.xml" ) file.count() file.first()
Scala测试 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 $ MASTER =spark: executor-memory 1024 m \ --total-executor-cores 2 \ --excutor-cores 1 \ # 添加参数启动`spark-shell` import org.apache.spark.sql.SQLContext val sqlContext = new org.apache.spark.sql.hive.HiveContext (sc)sqlContext.sql("select * from words" ).collect().foreach(println) sqlContext.sql("select id, word from words order by id" ).collect().foreach(println) sqlContext.sql("insert into words values(7, \"jd\")" ) val df = sqlContext.sql("select * from words" );df.show() var df = spark.read.json("file:///opt/spark/example/src/main/resources/people.json" )df.show()
Spark on YARN 其他 可能错误
Initial job has not accepted any resources;
ERROR KeyProviderCache:87 - Could not find uri with key [dfs.encryption.key.provider.uri] to create a keyProvider
HBase 依赖
java
hadoop
zookeeper:建议,否则日志不好管理
机器环境 ~/.bashrc
1 2 3 export HBASE_HOME=/opt/hbase export PATH=$PAHT:$HBASE_HOME/bin export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:$HBASE_HOME/lib/*
建立目录 1 $ mkdir /tmp/hbase/tmpdir
HBase配置 环境变量 conf/hbase-env.sh
1 2 export HBASE_MANAGES_ZK=false # 不使用自带zookeeper
conf/zoo.cfg
若设置使用独立zookeeper,需要复制zookeeper配置至HBase配置
文件夹中
1 $ cp /opt/zookeeper/conf/zoo.cfg /opt/hbase/conf
Standalone模式 conf/hbase-site.xml
1 2 3 4 5 6 7 8 9 10 <configuration > <property > <name > hbase.rootdir</name > <value > file://${HBASE_HOME}/data</value > </property > <property > <name > hbase.zookeeper.property.dataDir</name > <value > /tmp/zookeeper/zkdata</value > </property > </configuration >
Pseudo-Distributed模式 conf/hbase-site.xml
1 2 3 4 5 6 7 8 <proeperty > <name > hbase.cluster.distributed</name > <value > true</value > </property > <property > <name > hbase.rootdir</name > <value > hdfs://hd-master:9000/hbase</value > </property >
Fully-Distributed模式 conf/hbase-site.xml
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 <property > <name > hbase.rootdir</name > <value > hdfs://hd-master:9000/hbase</value > </property > <property > <name > hbase.cluster.distributed</name > <value > true</name > </property > <property > <name > hbase.zookeeper.quorum</name > <value > hd-master,hd-slave1,hd-slave2</value > </property > <property > <name > hbase.zookeeper.property.dataDir</name > <value > /tmp/zookeeper/zkdata</value > </property >
测试
需要首先启动HDFS、YARN
使用独立zookeeper还需要先行在每个节点启动zookeeper
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 $ start-hbase.sh # 启动HBase服务 $ local-regionservers.sh start 2 3 4 5 # 启动额外的4个RegionServer $ hbase shell hbase> create 'test', 'cf' hbase> list 'test' hbase> put 'test', 'row7', 'cf:a', 'value7a' put 'test', 'row7', 'cf:b', 'value7b' put 'test', 'row7', 'cf:c', 'value7c' put 'test', 'row8', 'cf:b', 'value8b', put 'test', 'row9', 'cf:c', 'value9c' hbase> scan 'test' hbase> get 'test', 'row7' hbase> disable 'test' hbase> enable 'test' hbaee> drop 'test' hbase> quit
Zookeeper 依赖
机器环境 ~/.bashrc
1 2 3 export ZOOKEEPER_HOME=/opt/zookeeper export PATH=$PATH:$ZOOKEEPER_HOME/bin export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:$ZOOKEEPER_HOME/lib
创建文件夹
1 2 3 4 5 6 7 mkdir -p /tmp/zookeeper/zkdata /tmp/zookeeper/zkdatalog echo 0 > /tmp/zookeeper/zkdatalog/myid ssh centos2 mkdir -p /tmp/zookeeper/zkdata /tmp/zookeeper/zkdatalog ssh centos3 mkdir -p /tmp/zookeeper/zkdata /tmp/zookeeper/zkdatalog ssh centos2 "echo 2 > /tmp/zookeeper/zkdata/myid" ssh centos3 "echo 3 > /tmp/zookeeper/zkdata/myid"
Zookeeper配置 Conf conf/zoo.cfg
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 tickTime=2000 # The number of milliseconds of each tick initLimit=10 # The number of ticks that the initial # synchronization phase can take syncLimit=5 # The number of ticks that can pass between # sending a request and getting an acknowledgement dataDir=/tmp/zookeeper/zkdata dataLogDir=/tmp/zookeeper/zkdatalog # the directory where the snapshot is stored. # do not use /tmp for storage, /tmp here is just # example sakes. clientPort=2181 # the port at which the clients will connect autopurge.snapRetainCount=3 # Be sure to read the maintenance section of the # administrator guide before turning on autopurge. # http://zookeeper.apache.org/doc/current/zookeeperAdmin.html#sc_maintenance # The number of snapshots to retain in dataDir autopurge.purgeInterval=1 # Purge task interval in hours # Set to "0" to disable auto purge feature server.0=hd-master:2888:3888 server.1=hd-slave1:2888:3888 server.2=hd-slave2:2888:3888 # Determine the zookeeper servers # fromation: server.NO=HOST:PORT1:PORT2 # PORT1: port used to communicate with leader # PORT2: port used to reelect leader when current leader fail
$dataDir/myid
$dataDir
是conf/zoo.cfg
中指定目录
myid
文件里就一个id,指明当前zookeeper server的id,服务
启动时读取文件确定其id,需要自行创建
启动、测试、清理 启动zookeeper
1 2 3 4 5 6 7 8 9 $ zkServer.sh start # 开启zookeeper服务 # zookeeper服务要在各个节点分别手动启动 $ zkServer.sh status # 查看服务状态 $ zkCleanup.sh # 清理旧的快照、日志文件
Flume 依赖
机器环境配置 ~/.bashrc
1 export PATH=$PATH:/opt/flume/bin
Flume配置 环境设置文件 conf/flume-env.sh
模板:conf/flume-env.sh.template
Conf文件 conf/flume.conf
模板:conf/flume-conf.properties.template
1 2 3 4 5 6 7 8 9 10 11 12 13 14 agent1.channels.ch1.type=memory # define a memory channel called `ch1` on `agent1` agent1.sources.avro-source1.channels=ch1 agent1.sources.avro-source1.type=avro agent1.sources.avro-source1.bind=0.0.0.0 agent1.sources.avro-source1.prot=41414 # define an Avro source called `avro-source1` on `agent1` and tell it agent1.sink.log-sink1.channels=ch1 agent1.sink.log-sink1.type=logger # define a logger sink that simply logs all events it receives agent1.channels=ch1 agent1.sources=avro-source1 agent1.sinks=log-sink1 # Finally, all components have been defined, tell `agent1` which one to activate
启动、测试 1 2 3 4 5 6 7 8 9 10 11 $ flume-ng agent --conf /opt/flume/conf \ -f /conf/flume.conf \ -D flume.root.logger=DEBUG,console \ -n agent1 # the agent name specified by -n agent1` must match an agent name in `-f /conf/flume.conf` $ flume-ng avro-client --conf /opt/flume/conf \ -H localhost -p 41414 \ -F /opt/hive-test.txt \ -D flume.root.logger=DEBUG, Console # 测试flume
其他 Kafka 依赖
机器环境变量 ~/.bashrc
1 2 export PATH=$PATH:/opt/kafka/bin export KAFKA_HOME=/opt/kafka
多brokers配置 Conf config/server-1.properties
模板:config/server.properties
不同节点broker.id
不能相同
可以多编写几个配置文件,在不同节点使用不同配置文件启动
1 2 3 broker.id=0 listeners=PLAINTEXT://:9093 zookeeper.connect=hd-master:2181, hd-slave1:2181, hd-slave2:2181
测试
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 $ kafka-server-start.sh /opt/kafka/config/server.properties & # 开启kafka服务(broker) # 这里是指定使用单个默认配置文件启动broker # 启动多个broker需要分别使用多个配置启动多次 $ kafka-server-stop.sh /opt/kafka/config/server.properties $ kafka-topics.sh --create --zookeeper localhost:2181 \ --replication-factor 1 \ --partitions 1 \ --topic test1 # 开启话题 $ kafka-topics.sh --list zookeeper localhost:2181 # $ kafka-topics.shd --delete --zookeeper localhost:2181 --topic test1 # 关闭话题 $ kafka-console-producer.sh --broker-list localhost:9092 \ --topic test1 # 新终端开启producer,可以开始发送消息 $ kafka-console-consumer.sh --bootstrap-server localhost:9092 \ --topic test1 \ --from-beginning $ kafka-console-consumer.sh --zookeeper localhost:2181 \ --topic test1 \ --from beginning # 新终端开启consumer,可以开始接收信息 # 这个好像是错的
其他 Storm 依赖
java
zookeeper
python2.6+
ZeroMQ、JZMQ
机器环境配置 ~/.bashrc
1 2 export STORM_HOME=/opt/storm export PAT=$PATH:$STORM_HOME/bin
Storm配置 配置文件 conf/storm.yaml
1 2 3 4 5 6 7 8 9 10 11 12 13 14 storm.zookeeper.servers: -hd-master -hd-slave1 -hd-slave2 storm.zookeeper.port: 2181 nimbus.seeds: [hd-master] storm.local.dir: /tmp/storm/tmp nimbus.host: hd-master supervisor.slots.ports: -6700 -6701 -6702 -6703
启动、测试 1 2 3 4 5 6 7 8 9 10 11 12 13 14 storm nimbus &> /dev/null & storm logviewer &> /dev/null & storm ui &> /dev/null & # master节点启动nimbus storm sueprvisor &> /dev/null & storm logviewer &> /dev/nulla & # worker节点启动 storm jar /opt/storm/example/..../storm-start.jar \ storm.starter.WordCountTopology # 测试用例 stom kill WordCountTopology
http://hadoop.apache.org/docs/r3.1.1