cloudera cdh6 添加spark-sql

张映 发表于 2019-12-03

分类目录: hadoop/spark/scala

标签:, ,

spark-sql常用的查询工具,速度比较hivesql要快。但是cdh6并没有spark-sql。

在看这篇文章前,先看:cdh 6 使用独立的 apache spark

1,取消环境变量

  1. # unset KAFKA_HOME FLUME_HOME HBASE_HOME HIVE_HOME SPARK_HOME HADOOP_HOME SQOOP_HOME KYLIN_HOME  

以前装过独立的hadoop生态圈,最好是取消掉。

2,遇到的问题

Warning: Failed to load org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver: org/apache/hadoop/hive/cli/CliDriver
Failed to load hive class.
You need to build Spark with -Phive and -Phive-thriftserver.
19/12/03 14:18:16 INFO util.ShutdownHookManager: Shutdown hook called
19/12/03 14:18:16 INFO util.ShutdownHookManager: Deleting directory /tmp/spark-81c54c42-0cfd-47f5-ab9e-7853ed23e181

使用了各种办法,包括下源码包重新编译都没有成功。

3,用独立安装的spark

请参考:spark on yarn 安装配置

  1. # cp -r /bigdata/spark /opt/cloudera/parcels/CDH/lib/spark2  
  2. # cd /opt/cloudera/parcels/CDH/lib/spark2  
  3. # rm -rf conf  //删除原来的配置文件  

4,将cdh6 spark的配置copy到独立的spark根目录下

  1. mkdir /opt/cloudera/parcels/CDH/lib/spark2/conf  
  2. # cp -r /etc/spark/conf/* /opt/cloudera/parcels/CDH/lib/spark2/conf  
  3. # cd /opt/cloudera/parcels/CDH/lib/spark2/conf  
  4. # mv spark-env.sh spark-env  //这一步很重要  

其实我不并不想,让spark-sql走现在spark环境,我只需要让spark-sql走hive元数据库

5,将hive-site.xml copy到spark2/conf

  1. # cp /etc/hive/conf/hive-site.xml ./  
  2. # vim hive-site.xml  //将根thrift相关的配置删除  

6,配置环境变量

  1. # export HADOOP_CONF_DIR=/etc/hadoop/conf  
  2. # export YARN_CONF_DIR=/etc/hadoop/conf  

如果不配置会报以下错误。

Exception in thread "main" org.apache.spark.SparkException: When running with master 'yarn' either HADOOP_CONF_DIR or YARN_CONF_DIR must be set in the environment.
at org.apache.spark.deploy.SparkSubmitArguments.error(SparkSubmitArguments.scala:657)
at org.apache.spark.deploy.SparkSubmitArguments.validateSubmitArguments(SparkSubmitArguments.scala:290)
at org.apache.spark.deploy.SparkSubmitArguments.validateArguments(SparkSubmitArguments.scala:251)

7,设置yarn.resourcemanager

spark-sql所在机器加入到resourcemanager

spark-sql所在机器加入到resourcemanager

resourcemanager的机器,8030,8032会开启

resourcemanager的机器,8030,8032会开启

其实我们也可以修改yarn-site.xml,既然用了cdh,就不推荐修改xml(自建的除外),设置完了以后,要重启cdh6,重启cdh6,重启cdh6

如果不设置会报以下错误

19/12/06 18:27:40 INFO ipc.Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8032. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
19/12/06 18:27:41 INFO ipc.Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8032. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
19/12/06 18:27:42 INFO ipc.Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8032. Already tried 2 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
19/12/06 18:27:43 INFO ipc.Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8032. Already tried 3 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
19/12/06 18:27:44 INFO ipc.Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8032. Already tried 4 time(s); retry policy is

8,创建spark-sql脚本

  1. # cat /opt/cloudera/parcels/CDH/bin/spark-sql  
  2. #!/bin/bash  
  3.  # Reference: http://stackoverflow.com/questions/59895/can-a-bash-script-tell-what-directory-its-stored-in  
  4.  SOURCE="${BASH_SOURCE[0]}"  
  5.  BIN_DIR="$( dirname "$SOURCE" )"  
  6.  while [ -h "$SOURCE" ]  
  7.  do  
  8.  SOURCE="$(readlink "$SOURCE")"  
  9.  [[ $SOURCE != /* ]] && SOURCE="$BIN_DIR/$SOURCE"  
  10.  BIN_DIR="$( cd -P "$( dirname "$SOURCE" )" && pwd )"  
  11.  done  
  12.  BIN_DIR="$( cd -P "$( dirname "$SOURCE" )" && pwd )"  
  13.  LIB_DIR=$BIN_DIR/../lib  
  14. export HADOOP_HOME=$LIB_DIR/hadoop  
  15.   
  16. # Autodetect JAVA_HOME if not defined  
  17. $LIB_DIR/bigtop-utils/bigtop-detect-javahome  
  18.   
  19. exec $LIB_DIR/spark2/bin/spark-submit --class org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver "$@"  

9,加入可执行目录

  1. # alternatives --install /usr/bin/spark-sql spark-sql /opt/cloudera/parcels/CDH/bin/spark-sql 1  

这样就可以用spark-sql了,这种安装方式,不会对cdh6产生破坏性影响。



转载请注明
作者:海底苍鹰
地址:http://blog.51yip.com/hadoop/2286.html

2 条评论

  1. zhc 留言

    为什么我配置第7步设置yarn.resourcemanager后,启动失败。
    另外spark需要进行编译吗?
    Failed to load main class org.apache.spark.sql.hive.thriftserver.HiveThriftServer2.
    You need to build Spark with -Phive and -Phive-thriftserver.

  2. feego 留言

    兄弟,我按照你的步骤操作下,还是有报When running with master 'yarn' either HADOOP_CONF_DIR or YARN_CONF_DIR must be set in the environment.
    这一步是为了让新的spark走cdh的配置?
    mv spark-env.sh spark-env //这一步很重要
    环境变量已经设置了上面两个,还有其他原因么?