scala spark sql 本地调试

张映 发表于 2020-01-02

分类目录: hadoop/spark/scala

标签:, , ,

对于习惯了sql的开发同学来说,写sql肯定比较用map,filter内在算法因子要顺手的多。

一,sbt项目

1,build.sbt配置

name := "scalatest"

version := "0.1"

scalaVersion := "2.11.8"

libraryDependencies += "com.alibaba" % "fastjson" % "1.2.49"

libraryDependencies ++= Seq(
    "org.apache.spark" % "spark-core_2.11" % "2.3.0",
    "org.apache.spark" % "spark-hive_2.11" % "2.3.0",
    "org.apache.spark" % "spark-sql_2.11" % "2.3.0"
)

spark-core,spark-hive,spark-sql的版本,根据自己的实际情况来定。

2,测试代码

package ex

import org.apache.spark.sql.SparkSession

object tank {

    var data = ""

    def main(args: Array[String]): Unit = {

        val spark = SparkSession.builder().
                master("local")
//              .config("spark.sql.hive.thriftServer.singleSession", true)
                .enableHiveSupport()
                .appName("tanktest").getOrCreate()

        import spark.implicits._

        val tanktest:String = "create table `tank_test` ("+
        "`creative_id` string,"+
        "`category_name`  string,"+
        "`ad_keywords` string,"+
        "`creative_type` string,"+
        "`inventory_type` string,"+
        "`gender` string,"+
        " `source` string,"+
        " `advanced_creative_title` string,"+
        " `first_industry_name` string,"+
        " `second_industry_name` string)"+
        " ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' LINES TERMINATED BY '\n' STORED AS TEXTFILE";

        //获取参数
        for(i <- args.indices by 2){
            args(i) match {
                case "--data" => data=args(i+1);
                case _ => "error";
            }
        }
        spark.sql(tanktest)
        spark.sql(s"LOAD DATA LOCAL INPATH '$data/creat_partd' INTO TABLE tank_test")
        spark.sql("select count(*) as total from tank_test").show()

    }
}

如果报:

Exception in thread "main" org.apache.spark.sql.AnalysisException: Hive support is required to CREATE Hive TABLE (AS SELECT);

解决办法有二种:

//以下二步,二选一
//  .config("spark.sql.hive.thriftServer.singleSession", true)
    .enableHiveSupport()

3,idea debug配置

spark-sql 本地debug配置

spark-sql 本地debug配置

4,调式结果

spark-sql 本debug测试

spark-sql 本debug测试

spark-sql本地调度结果

spark-sql本地调度结果

注意:本地调试,并没有连接远程的hive。也没有设置hive.metastore.warehouse.dir,所有元数据目录,以及数据目录,都在当前项目目录下了。

二,mvn项目

1,pom.xml添加以下内容

<dependency>
    <groupId>org.apache.spark</groupId>
    <artifactId>spark-core_2.11</artifactId>
    <version>2.3.0</version>
</dependency>

<dependency>
    <groupId>org.apache.spark</groupId>
    <artifactId>spark-hive_2.11</artifactId>
    <version>2.3.0</version>
</dependency>

<dependency>
    <groupId>org.apache.spark</groupId>
    <artifactId>spark-sql_2.11</artifactId>
    <version>2.3.0</version>
</dependency>

其他的,根上面一样。这种本地开发,只能让代码中的sql运行起来,没有数据。数据只能从线上copy,下一篇,会说一说,本地spark-sql怎么连接线上的hive。



转载请注明
作者:海底苍鹰
地址:http://blog.51yip.com/hadoop/2333.html