spark-sql常用的查询工具,速度比较hivesql要快。但是cdh6并没有spark-sql。
在看这篇文章前,先看:cdh 6 使用独立的 apache spark
1,取消环境变量
- # unset KAFKA_HOME FLUME_HOME HBASE_HOME HIVE_HOME SPARK_HOME HADOOP_HOME SQOOP_HOME KYLIN_HOME
以前装过独立的hadoop生态圈,最好是取消掉。
2,遇到的问题
Warning: Failed to load org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver: org/apache/hadoop/hive/cli/CliDriver
Failed to load hive class.
You need to build Spark with -Phive and -Phive-thriftserver.
19/12/03 14:18:16 INFO util.ShutdownHookManager: Shutdown hook called
19/12/03 14:18:16 INFO util.ShutdownHookManager: Deleting directory /tmp/spark-81c54c42-0cfd-47f5-ab9e-7853ed23e181
使用了各种办法,包括下源码包重新编译都没有成功。
3,用独立安装的spark
- # cp -r /bigdata/spark /opt/cloudera/parcels/CDH/lib/spark2
- # cd /opt/cloudera/parcels/CDH/lib/spark2
- # rm -rf conf //删除原来的配置文件
4,将cdh6 spark的配置copy到独立的spark根目录下
- # mkdir /opt/cloudera/parcels/CDH/lib/spark2/conf
- # cp -r /etc/spark/conf/* /opt/cloudera/parcels/CDH/lib/spark2/conf
- # cd /opt/cloudera/parcels/CDH/lib/spark2/conf
- # mv spark-env.sh spark-env //这一步很重要
其实我不并不想,让spark-sql走现在spark环境,我只需要让spark-sql走hive元数据库
5,将hive-site.xml copy到spark2/conf
- # cp /etc/hive/conf/hive-site.xml ./
- # vim hive-site.xml //将根thrift相关的配置删除
6,配置环境变量
- # export HADOOP_CONF_DIR=/etc/hadoop/conf
- # export YARN_CONF_DIR=/etc/hadoop/conf
如果不配置会报以下错误。
Exception in thread "main" org.apache.spark.SparkException: When running with master 'yarn' either HADOOP_CONF_DIR or YARN_CONF_DIR must be set in the environment.
at org.apache.spark.deploy.SparkSubmitArguments.error(SparkSubmitArguments.scala:657)
at org.apache.spark.deploy.SparkSubmitArguments.validateSubmitArguments(SparkSubmitArguments.scala:290)
at org.apache.spark.deploy.SparkSubmitArguments.validateArguments(SparkSubmitArguments.scala:251)
7,设置yarn.resourcemanager
其实我们也可以修改yarn-site.xml,既然用了cdh,就不推荐修改xml(自建的除外),设置完了以后,要重启cdh6,重启cdh6,重启cdh6
如果不设置会报以下错误:
19/12/06 18:27:40 INFO ipc.Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8032. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
19/12/06 18:27:41 INFO ipc.Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8032. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
19/12/06 18:27:42 INFO ipc.Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8032. Already tried 2 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
19/12/06 18:27:43 INFO ipc.Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8032. Already tried 3 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
19/12/06 18:27:44 INFO ipc.Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8032. Already tried 4 time(s); retry policy is
8,创建spark-sql脚本
- # cat /opt/cloudera/parcels/CDH/bin/spark-sql
- #!/bin/bash
- # Reference: http://stackoverflow.com/questions/59895/can-a-bash-script-tell-what-directory-its-stored-in
- SOURCE="${BASH_SOURCE[0]}"
- BIN_DIR="$( dirname "$SOURCE" )"
- while [ -h "$SOURCE" ]
- do
- SOURCE="$(readlink "$SOURCE")"
- [[ $SOURCE != /* ]] && SOURCE="$BIN_DIR/$SOURCE"
- BIN_DIR="$( cd -P "$( dirname "$SOURCE" )" && pwd )"
- done
- BIN_DIR="$( cd -P "$( dirname "$SOURCE" )" && pwd )"
- LIB_DIR=$BIN_DIR/../lib
- export HADOOP_HOME=$LIB_DIR/hadoop
- # Autodetect JAVA_HOME if not defined
- . $LIB_DIR/bigtop-utils/bigtop-detect-javahome
- exec $LIB_DIR/spark2/bin/spark-submit --class org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver "$@"
9,加入可执行目录
- # alternatives --install /usr/bin/spark-sql spark-sql /opt/cloudera/parcels/CDH/bin/spark-sql 1
这样就可以用spark-sql了,这种安装方式,不会对cdh6产生破坏性影响。
转载请注明
作者:海底苍鹰
地址:http://blog.51yip.com/hadoop/2286.html
为什么我配置第7步设置yarn.resourcemanager后,启动失败。
另外spark需要进行编译吗?
Failed to load main class org.apache.spark.sql.hive.thriftserver.HiveThriftServer2.
You need to build Spark with -Phive and -Phive-thriftserver.
兄弟,我按照你的步骤操作下,还是有报When running with master 'yarn' either HADOOP_CONF_DIR or YARN_CONF_DIR must be set in the environment.
这一步是为了让新的spark走cdh的配置?
mv spark-env.sh spark-env //这一步很重要
环境变量已经设置了上面两个,还有其他原因么?