cloudera cdh6 添加spark-sql

张映 发表于 2019-12-03

分类目录: hadoop/spark/scala

标签:, ,

spark-sql常用的查询工具,速度比较hivesql要快。但是cdh6并没有spark-sql。

1,取消环境变量

# unset KAFKA_HOME FLUME_HOME HBASE_HOME HIVE_HOME SPARK_HOME HADOOP_HOME SQOOP_HOME KYLIN_HOME

以前装过独立的hadoop生态圈,最好是取消掉。

2,遇到的问题

Warning: Failed to load org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver: org/apache/hadoop/hive/cli/CliDriver
Failed to load hive class.
You need to build Spark with -Phive and -Phive-thriftserver.
19/12/03 14:18:16 INFO util.ShutdownHookManager: Shutdown hook called
19/12/03 14:18:16 INFO util.ShutdownHookManager: Deleting directory /tmp/spark-81c54c42-0cfd-47f5-ab9e-7853ed23e181

使用了各种办法,包括下源码包重新编译都没有成功。

3,用独立安装的spark

请参考:spark on yarn 安装配置

# cp -r /bigdata/spark /opt/cloudera/parcels/CDH/lib/spark2
# cd /opt/cloudera/parcels/CDH/lib/spark2
# rm -rf conf  //删除原来的配置文件

4,将cdh6 spark的配置copy到独立的spark根目录下

# mkdir /opt/cloudera/parcels/CDH/lib/spark2/conf
# cp -r /etc/spark/conf/* /opt/cloudera/parcels/CDH/lib/spark2/conf
# cd /opt/cloudera/parcels/CDH/lib/spark2/conf
# mv spark-env.sh spark-env  //这一步很重要

其实我不并不想,让spark-sql走现在spark环境,我只需要让spark-sql走hive元数据库

5,将hive-site.xml copy到spark2/conf

# cp /etc/hive/conf/hive-site.xml ./
# vim hive-site.xml  //将根thrift相关的配置删除

6,配置环境变量

# export HADOOP_CONF_DIR=/etc/hadoop/conf
# export YARN_CONF_DIR=/etc/hadoop/conf

如果不配置会报以下错误。

Exception in thread "main" org.apache.spark.SparkException: When running with master 'yarn' either HADOOP_CONF_DIR or YARN_CONF_DIR must be set in the environment.
at org.apache.spark.deploy.SparkSubmitArguments.error(SparkSubmitArguments.scala:657)
at org.apache.spark.deploy.SparkSubmitArguments.validateSubmitArguments(SparkSubmitArguments.scala:290)
at org.apache.spark.deploy.SparkSubmitArguments.validateArguments(SparkSubmitArguments.scala:251)

7,设置yarn.resourcemanager

spark-sql所在机器加入到resourcemanager

spark-sql所在机器加入到resourcemanager

resourcemanager的机器,8030,8032会开启

resourcemanager的机器,8030,8032会开启

其实我们也可以修改yarn-site.xml,既然用了cdh,就不推荐修改xml(自建的除外),设置完了以后,要重启cdh6,重启cdh6,重启cdh6

如果不设置会报以下错误

19/12/06 18:27:40 INFO ipc.Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8032. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
19/12/06 18:27:41 INFO ipc.Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8032. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
19/12/06 18:27:42 INFO ipc.Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8032. Already tried 2 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
19/12/06 18:27:43 INFO ipc.Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8032. Already tried 3 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
19/12/06 18:27:44 INFO ipc.Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8032. Already tried 4 time(s); retry policy is

8,创建spark-sql脚本

# cat /opt/cloudera/parcels/CDH/bin/spark-sql
#!/bin/bash
 # Reference: http://stackoverflow.com/questions/59895/can-a-bash-script-tell-what-directory-its-stored-in
 SOURCE="${BASH_SOURCE[0]}"
 BIN_DIR="$( dirname "$SOURCE" )"
 while [ -h "$SOURCE" ]
 do
 SOURCE="$(readlink "$SOURCE")"
 [[ $SOURCE != /* ]] && SOURCE="$BIN_DIR/$SOURCE"
 BIN_DIR="$( cd -P "$( dirname "$SOURCE" )" && pwd )"
 done
 BIN_DIR="$( cd -P "$( dirname "$SOURCE" )" && pwd )"
 LIB_DIR=$BIN_DIR/../lib
export HADOOP_HOME=$LIB_DIR/hadoop

# Autodetect JAVA_HOME if not defined
. $LIB_DIR/bigtop-utils/bigtop-detect-javahome

exec $LIB_DIR/spark2/bin/spark-submit --class org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver "$@"

9,加入可执行目录

# alternatives --install /usr/bin/spark-sql spark-sql /opt/cloudera/parcels/CDH/bin/spark-sql 1

这样就可以用spark-sql了,这种安装方式,不会对cdh6产生破坏性影响。



转载请注明
作者:海底苍鹰
地址:http://blog.51yip.com/hadoop/2286.html

留下评论

留下评论
  • (必需)
  • (必需) (will not be published)
  • (必需)   8X2=?