cdh6默认没有spark-sql,对于开发来说,有没有spark-sql都不重要,建议开发者,尽量少用sql语句。而对于数据分析人员来说,hive sql较慢,spark-sql还是比较合适的。
cdh6的安装,请参考:cloudera cdh 6.3 安装配置
一,下载原生apache spark
# wget http://archive.apache.org/dist/spark/spark-2.4.0/spark-2.4.0-bin-hadoop2.7.tgz # tar zxvf spark-2.4.0-bin-hadoop2.7.tgz # mv spark-2.4.0-bin-hadoop2.7 /bigdata/spark
为什么会选择spark-2.4.0-bin-hadoop2.7.tgz。
1,cdh6.3.1用的spark就是2.4.0的
2,截止到今天,spark-2.4.0能兼容到hadoop只到2.7(稳定最新版),cdh6.3.1用的hadoop是3.0.0版本,但是不影响spark的使用
二,设置环境变量
# vim ~/.bashrc export HADOOP_HOME=/opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/lib/hadoop export HADOOP_CONF_DIR=/etc/hadoop/conf export YARN_CONF_DIR=/etc/hadoop/conf export SPARK_CONF_DIR=/bigdata/spark/conf export SPARK_HOME=/bigdata/spark export PATH=$SPARK_HOME/bin:$PATH export SPARK_HISTORY_OPTS="-Dspark.history.retainedApplications=3 -Dspark.history.fs.logDirectory=hdfs://bigdata1/spark/logs" # source ~/.bashrc
这块配置过了一后,spark-env.sh就不用在配置了
三,配置spark
1,安装mysql-connector-java
# yum install mysql-connector-java # cp /usr/share/java/mysql-connector-java.jar /bigdata/spark/jars/
2,上传spark jar包
# hdfs dfs -mkdir /spark # hdfs dfs -mkdir /spark/jars # hdfs dfs -mkdir /spark/logs # cd /bigdata/spark/jars # hdfs dfs -put ./*.jar /spark/jars
3,配置spark
# vim /bigdata/spark/conf/spark-defaults.conf spark.master yarn spark.eventLog.enabled true spark.eventLog.dir hdfs://bigdata1/spark/logs spark.driver.memory 2g //注意内存大小 spark.executor.memory 2g //注意内存大小 spark.shuffle.service.enabled true spark.shuffle.service.port 7337 spark.dynamicAllocation.enabled true spark.dynamicAllocation.minExecutors 1 spark.dynamicAllocation.maxExecutors 6 spark.dynamicAllocation.schedulerBacklogTimeout 1s spark.dynamicAllocation.sustainedSchedulerBacklogTimeout 5s spark.submit.deployMode client spark.yarn.jars hdfs://bigdata1/spark/jars/* spark.serializer org.apache.spark.serializer.KryoSerializer
如果报以下错误:
java.lang.IllegalArgumentException: Required executor memory (2048), overhead (384 MB), and PySpark memory (0 MB) is above the max threshold (1536 MB) of this cluster! Please check the values of 'yarn.scheduler.maximum-allocation-mb' and/or 'yarn.nodemanager.resource.memory-mb'.
at
解决办法有二种:
spark.driver.memory 1g
spark.executor.memory 1g
第一种,将上面二个参数改小
第二种,将下面二个参数,加大
# vim /etc/hadoop/conf/yarn-site.xml <property> <name>yarn.nodemanager.resource.memory-mb</name> <value>34576</value> </property> <property> <name>yarn.scheduler.maximum-allocation-mb</name> <value>34576</value> </property>
4,配置hive-site.xml
<?xml version="1.0" encoding="UTF-8" standalone="no"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <property> <name>javax.jdo.option.ConnectionURL</name> <value>jdbc:mysql://bigserver1:3306/metastore?createDatabaseIfNotExist=true</value> </property> <property> <name>javax.jdo.option.ConnectionDriverName</name> <value>com.mysql.jdbc.Driver</value> </property> <property> <name>javax.jdo.option.ConnectionUserName</name> <value>cdh6</value> </property> <property> <name>javax.jdo.option.ConnectionPassword</name> <value>Cdh6_123</value> </property> <property> <name>hive.exec.scratchdir</name> <value>/user/hive/tmp</value> </property> <property> <name>hive.metastore.warehouse.dir</name> <value>/user/hive/warehouse</value> </property> <property> <name>hive.querylog.location</name> <value>/user/hive/log</value> </property> </configuration>
这块并没有用cdh6,自带的hive-site.xml。因为没必要,其实只要一个连接就够了。
5,修改slaves
# cat slaves 10.0.40.237 bigserver1 10.0.40.222 bigserver2 10.0.40.193 bigserver3 10.0.40.200 bigserver4 10.0.10.245 bigserver5
6,删除cdh6.3自带的spark-submit等
# whereis spark-submit spark-submit: /usr/bin/spark-submit /bigdata/spark/bin/spark-submit2.cmd /bigdata/spark/bin/spark-submit.cmd /bigdata/spark/bin/spark-submit # rm -f /usr/bin/spark-submit
将以上的所有操作,在spark集群中的机器操作一遍。spark文件夹,可以scp,这样节省不少时间。
四,看一下效果
# ll /opt/cloudera/parcels/CDH/lib/spark/jars |grep spark-catalyst
lrwxrwxrwx 1 root root 52 9月 26 18:55 spark-catalyst_2.11-2.4.0-cdh6.3.1.jar -> ../../../jars/spark-catalyst_2.11-2.4.0-cdh6.3.1.jar
原生的spark-catalyst带cdh后缀
转载请注明
作者:海底苍鹰
地址:http://blog.51yip.com/hadoop/2326.html