cdh 6 使用独立的 apache spark

张映 发表于 2019-12-27

分类目录: hadoop/spark/scala

标签:, ,

cdh6默认没有spark-sql,对于开发来说,有没有spark-sql都不重要,建议开发者,尽量少用sql语句。而对于数据分析人员来说,hive sql较慢,spark-sql还是比较合适的。

cdh6的安装,请参考:cloudera cdh 6.3 安装配置

一,下载原生apache spark

# wget http://archive.apache.org/dist/spark/spark-2.4.0/spark-2.4.0-bin-hadoop2.7.tgz
# tar zxvf spark-2.4.0-bin-hadoop2.7.tgz
# mv spark-2.4.0-bin-hadoop2.7 /bigdata/spark

为什么会选择spark-2.4.0-bin-hadoop2.7.tgz。

1,cdh6.3.1用的spark就是2.4.0的

2,截止到今天,spark-2.4.0能兼容到hadoop只到2.7(稳定最新版),cdh6.3.1用的hadoop是3.0.0版本,但是不影响spark的使用

二,设置环境变量

# vim ~/.bashrc

export HADOOP_HOME=/opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/lib/hadoop
export HADOOP_CONF_DIR=/etc/hadoop/conf
export YARN_CONF_DIR=/etc/hadoop/conf
export SPARK_CONF_DIR=/bigdata/spark/conf
export SPARK_HOME=/bigdata/spark
export PATH=$SPARK_HOME/bin:$PATH
export SPARK_HISTORY_OPTS="-Dspark.history.retainedApplications=3 -Dspark.history.fs.logDirectory=hdfs://bigdata1/spark/logs"

# source ~/.bashrc

这块配置过了一后,spark-env.sh就不用在配置了

三,配置spark

1,安装mysql-connector-java

# yum install mysql-connector-java
# cp /usr/share/java/mysql-connector-java.jar /bigdata/spark/jars/

2,上传spark jar包

# hdfs dfs -mkdir /spark
# hdfs dfs -mkdir /spark/jars
# hdfs dfs -mkdir /spark/logs
# cd /bigdata/spark/jars
# hdfs dfs -put ./*.jar /spark/jars

3,配置spark

# vim /bigdata/spark/conf/spark-defaults.conf

spark.master yarn
spark.eventLog.enabled true
spark.eventLog.dir hdfs://bigdata1/spark/logs
spark.driver.memory 2g  //注意内存大小
spark.executor.memory 2g //注意内存大小

spark.shuffle.service.enabled true
spark.shuffle.service.port 7337
spark.dynamicAllocation.enabled true
spark.dynamicAllocation.minExecutors 1
spark.dynamicAllocation.maxExecutors 6
spark.dynamicAllocation.schedulerBacklogTimeout 1s
spark.dynamicAllocation.sustainedSchedulerBacklogTimeout 5s

spark.submit.deployMode client
spark.yarn.jars hdfs://bigdata1/spark/jars/*
spark.serializer org.apache.spark.serializer.KryoSerializer

如果报以下错误:

java.lang.IllegalArgumentException: Required executor memory (2048), overhead (384 MB), and PySpark memory (0 MB) is above the max threshold (1536 MB) of this cluster! Please check the values of 'yarn.scheduler.maximum-allocation-mb' and/or 'yarn.nodemanager.resource.memory-mb'.
at

解决办法有二种:

spark.driver.memory 1g
spark.executor.memory 1g

第一种,将上面二个参数改小

第二种,将下面二个参数,加大

# vim /etc/hadoop/conf/yarn-site.xml 

<property>
 <name>yarn.nodemanager.resource.memory-mb</name>
 <value>34576</value>
</property>
<property>
 <name>yarn.scheduler.maximum-allocation-mb</name>
 <value>34576</value>
</property>

4,配置hive-site.xml

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>

 <property>
 <name>javax.jdo.option.ConnectionURL</name>
 <value>jdbc:mysql://bigserver1:3306/metastore?createDatabaseIfNotExist=true</value>
 </property>

 <property>
 <name>javax.jdo.option.ConnectionDriverName</name>
 <value>com.mysql.jdbc.Driver</value>
 </property>

 <property>
 <name>javax.jdo.option.ConnectionUserName</name>
 <value>cdh6</value>
 </property>

 <property>
 <name>javax.jdo.option.ConnectionPassword</name>
 <value>Cdh6_123</value>
 </property>

 <property>
 <name>hive.exec.scratchdir</name>
 <value>/user/hive/tmp</value>
 </property>

 <property>
 <name>hive.metastore.warehouse.dir</name>
 <value>/user/hive/warehouse</value>
 </property>

 <property>
 <name>hive.querylog.location</name>
 <value>/user/hive/log</value>
 </property>

</configuration>

这块并没有用cdh6,自带的hive-site.xml。因为没必要,其实只要一个连接就够了。

5,修改slaves

# cat slaves

10.0.40.237 bigserver1
10.0.40.222 bigserver2
10.0.40.193 bigserver3
10.0.40.200 bigserver4
10.0.10.245 bigserver5

6,删除cdh6.3自带的spark-submit等

# whereis spark-submit
spark-submit: /usr/bin/spark-submit /bigdata/spark/bin/spark-submit2.cmd /bigdata/spark/bin/spark-submit.cmd /bigdata/spark/bin/spark-submit

# rm -f /usr/bin/spark-submit

将以上的所有操作,在spark集群中的机器操作一遍。spark文件夹,可以scp,这样节省不少时间。

四,看一下效果

cdh6 使用原生的spark

cdh6 使用原生的spark

# ll /opt/cloudera/parcels/CDH/lib/spark/jars |grep spark-catalyst

lrwxrwxrwx 1 root root 52 9月 26 18:55 spark-catalyst_2.11-2.4.0-cdh6.3.1.jar -> ../../../jars/spark-catalyst_2.11-2.4.0-cdh6.3.1.jar

原生的spark-catalyst带cdh后缀



转载请注明
作者:海底苍鹰
地址:http://blog.51yip.com/hadoop/2326.html