当表数据以文本文件的形势,存放在hdfs上,不管是内部表,还是外部表,导入数据都是比较好操作的。直接通过hdfs命令直接copy到文件服务器上的对应目录就好了,注意hdfs目录的访问权限。
parquet也是可以这样操作的
一,安装parquet-tool工具
1,下载apache-parquet
# wget https://github.com/apache/parquet-mr/archive/apache-parquet-1.10.0.tar.gz # tar zxvf apache-parquet-1.10.0.tar.gz # cd parquet-mr-apache-parquet-1.10.0/parquet-tools && mvn clean package -Plocal在
mvn编译时加上了-Plocal,这样本地用java就可以查看。如果不加,就要用hadoop命令去查看了。
不要从https://github.com/apache/parquet-mr.git,直接下。编译包时,会报以下错误。
[ERROR] Failed to execute goal org.apache.thrift:thrift-maven-plugin:0.10.0:compile (thrift-sources) on project parquet-format-structures: thrift did not exit cleanly. Review output for more information. -> [Help 1]
[ERROR] Failed to execute goal on project parquet-tools: Could not resolve dependencies for project org.apache.parquet:parquet-tools:jar:1.12.0-SNAPSHOT: The following artifacts could not be resolved: org.apache.parquet:parquet-format-structures:jar:1.12.0-SNAPSHOT, org.apache.parquet:parquet-hadoop:jar:1.12.0-SNAPSHOT: Could not find artifact org.apache.parquet:parquet-format-structures:jar:1.12.0-SNAPSHOT in aliyun-public (https://maven.aliyun.com/repository/public)
2,通过pip安装
# pip install parquet
二,测试parquet文件,是否正确
1,手动编辑安装的parquet-tool包,本地测试
$ alias pq_data alias pq_data='java -jar /parquet_mr安装路径/parquet-tools/target/parquet-tools-1.10.0.jar head -n 2 ' $ pq_data tank.parquet //查看parquet列中数据 creative_id = 1642531926139931 category_name = 国内线路游 ad_keywords = 爸爸;妈妈;老师;班主任;优良;培养;搞笑;经典;游戏;孩子;学习;教育;书籍;青少年;初中;小学;家长;家庭;学校;挂图; creative_type = 竖版视频 inventory_type = INVENTORY_FEED;INVENTORY_AWEME_FEED;INVENTORY_HOTSOON_FEED;INVENTORY_VIDEO_FEED; gender = NONE source = 硕硕图书 first_industry_name = 传媒及内容 second_industry_name = 书籍杂志 creative_id = 1642465004243004 category_name = 住房中介(公司) ad_keywords = 58同城;赶集;我爱我家;公寓出租;公寓;整租;单间;合租;租房;租房子;上海租房;上海租房网;租房网;上海合租;房屋出租;出租;公租房;租房攻略;廉租房;毕业租房; creative_type = 竖版视频 inventory_type = INVENTORY_UNION_SLOT;INVENTORY_FEED;INVENTORY_AWEME_FEED;INVENTORY_HOTSOON_FEED;INVENTORY_VIDEO_FEED; gender = NONE source = 上海青客租房 first_industry_name = 房地产 second_industry_name = 房地产中介 $ alias pq_schema alias pq_schema='java -jar /parquet_mr安装路径/parquet-tools/target/parquet-tools-1.10.0.jar schema ' $ pq_schema tank.parquet //查看parquet列结构 message spark_schema { optional int64 creative_id; optional binary category_name (UTF8); optional binary ad_keywords (UTF8); optional binary creative_type (UTF8); optional binary inventory_type (UTF8); optional binary gender (UTF8); optional binary source (UTF8); optional binary advanced_creative_title (UTF8); optional binary first_industry_name (UTF8); optional binary second_industry_name (UTF8); }
2,手动编辑安装的parquet-tool包,测试hdfs上的文件
# hadoop jar parquet-tools-1.11.0.jar schema -d hdfs://bigdata1/user/test.parquet |head -n 30 WARNING: Use "yarn jar" to launch YARN applications. message spark_schema { optional binary _c0 (STRING); optional binary _c1 (STRING); optional binary _c2 (STRING); optional binary _c3 (STRING); optional binary _c4 (STRING); optional binary _c5 (STRING); optional binary _c6 (STRING); optional binary _c7 (STRING); optional binary _c8 (STRING); optional binary _c9 (STRING); } # hadoop jar parquet-tools-1.11.0.jar head -n 2 hdfs://bigdata1/user/test.parquet WARNING: Use "yarn jar" to launch YARN applications. 19/12/12 16:58:29 INFO hadoop.ParquetFileReader: Initiating action with parallelism: 5 19/12/12 16:58:29 INFO hadoop.ParquetFileReader: reading another 1 footers 19/12/12 16:58:29 INFO hadoop.ParquetFileReader: Initiating action with parallelism: 5 19/12/12 16:58:30 INFO hadoop.InternalParquetRecordReader: RecordReader initialized will read a total of 18715 records. 19/12/12 16:58:30 INFO hadoop.InternalParquetRecordReader: at row 0. reading next block 19/12/12 16:58:30 INFO compress.CodecPool: Got brand-new decompressor [.snappy] 19/12/12 16:58:30 INFO hadoop.InternalParquetRecordReader: block read in memory in 67 ms. row count = 18715 _c0 = 1642643001630776 _c1 = 眼镜 _c2 = 眼镜;变色镜;夜视镜;太阳镜;男人;女人;驾驶;偏光镜;小姐姐;司机;太阳;夏天;日夜两用;开车;老司机;汽车;有车一族;夏季;清晰;眼睛; _c3 = 竖版视频 _c4 = INVENTORY_AWEME_FEED; _c5 = NONE _c6 = 闻书-免费试用 _c8 = 零售 _c9 = 综合类2C电商 _c0 = 1642637175138360 _c1 = 婚恋服务 _c2 = 陌陌;相亲;交友;恋爱;网友;网恋;婚恋;离异;离婚;再婚;结婚;对象;约会;约炮;同城;直播;单身;脱单;少妇;美女; _c3 = 大图竖图 _c4 = INVENTORY_HOTSOON_FEED; _c5 = GENDER_MALE _c7 = 私密相亲,1对1互动 _c8 = 生活服务 _c9 = 婚恋服务
3,通过pip安装parquet
$ parquet --help usage: parquet [-h] [--metadata] [--row-group-metadata] [--no-data] [--limit LIMIT] [--col COL] [--no-headers] [--format FORMAT] [--debug] file Read parquet files positional arguments: file path to the file to parse optional arguments: -h, --help show this help message and exit --metadata show metadata on file --row-group-metadata show per row group metadata --no-data don't dump any data from the file --limit LIMIT max records to output --col COL only include this column (can be specified multiple times) --no-headers skip headers in output (only applies if format=csv) --format FORMAT format for the output data. can be csv or json. --debug log debug info to stderr $ parquet tank.parquet --format json > test.json //将数据转换成json格式 $ parquet tank.parquet --limit 2 //显示条数据 $ parquet --metadata tank.parquet |less //查看表结构
三,创建表,并导入parquet文件
1,创建表
create table `tank` ( `creative_id` BIGINT, `category_name` string, `ad_keywords` string, `creative_type` string, `inventory_type` string, `gender` string, `source` string, `advanced_creative_title` string, `first_industry_name` string, `second_industry_name` string ) STORED AS PARQUET TBLPROPERTIES('parquet.compression'='SNAPPY');
表字段对应的类型,要和parquet schema的类型要对上,不然会报以下错误。
Failed with exception java.io.IOException:org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast to org.apache.hadoop.io.LongWritable
还有压缩方式也是一样。默认是gzip的。
2,导入数据
# mkdir /home/hdfs # cp tank.parquet /home/hdfs # chown hdfs.hdfs -R /home/hdfs //方法一 # load data local inpath '/home/hdfs/tank.parquet' OVERWRITE into table tanktest.tank; //方法二 # sudo -u hdfs hdfs dfs -put /home/hdfs/tank.parquet /home/cdh6/hive/warehouse/tanktest.db/tank/
这二种方法,都可以把文件传上去。装的cdh6,hadoop是3.0的,hdfs的文件权限不是root,所以要改变文件权限。
hive> desc formatted tank; OK # col_name data_type comment creative_id bigint category_name string ad_keywords string creative_type string inventory_type string gender string source string advanced_creative_title string first_industry_name string second_industry_name string # Detailed Table Information Database: tanktest OwnerType: USER Owner: hdfs CreateTime: Fri Dec 13 09:49:47 CST 2019 LastAccessTime: UNKNOWN Retention: 0 Location: hdfs://bigdata1/home/cdh6/hive/warehouse/tanktest.db/tank Table Type: MANAGED_TABLE Table Parameters: numFiles 1 numRows 0 parquet.compression SNAPPY rawDataSize 0 totalSize 660925 transient_lastDdlTime 1576208724 # Storage Information SerDe Library: org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe InputFormat: org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat OutputFormat: org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat Compressed: No Num Buckets: -1 Bucket Columns: [] Sort Columns: [] Storage Desc Params: serialization.format 1 Time taken: 0.171 seconds, Fetched: 40 row(s) hive> select creative_id,category_name from tank limit 2; OK 1642531926139931 国内线路游 1642465004243004 住房中介(公司) Time taken: 0.271 seconds, Fetched: 2 row(s)
转载请注明
作者:海底苍鹰
地址:http://blog.51yip.com/hadoop/2322.html