导入parquet文件到表中

张映 发表于 2019-12-16

分类目录: hadoop/spark/scala

标签:, ,

当表数据以文本文件的形势,存放在hdfs上,不管是内部表,还是外部表,导入数据都是比较好操作的。直接通过hdfs命令直接copy到文件服务器上的对应目录就好了,注意hdfs目录的访问权限。

parquet也是可以这样操作的

一,安装parquet-tool工具

1,下载apache-parquet

# wget https://github.com/apache/parquet-mr/archive/apache-parquet-1.10.0.tar.gz
# tar zxvf  apache-parquet-1.10.0.tar.gz
# cd parquet-mr-apache-parquet-1.10.0/parquet-tools && mvn clean package -Plocal在

mvn编译时加上了-Plocal,这样本地用java就可以查看。如果不加,就要用hadoop命令去查看了。

不要从https://github.com/apache/parquet-mr.git,直接下。编译包时,会报以下错误。

[ERROR] Failed to execute goal org.apache.thrift:thrift-maven-plugin:0.10.0:compile (thrift-sources) on project parquet-format-structures: thrift did not exit cleanly. Review output for more information. -> [Help 1]

[ERROR] Failed to execute goal on project parquet-tools: Could not resolve dependencies for project org.apache.parquet:parquet-tools:jar:1.12.0-SNAPSHOT: The following artifacts could not be resolved: org.apache.parquet:parquet-format-structures:jar:1.12.0-SNAPSHOT, org.apache.parquet:parquet-hadoop:jar:1.12.0-SNAPSHOT: Could not find artifact org.apache.parquet:parquet-format-structures:jar:1.12.0-SNAPSHOT in aliyun-public (https://maven.aliyun.com/repository/public)

2,通过pip安装

# pip install parquet

二,测试parquet文件,是否正确

1,手动编辑安装的parquet-tool包,本地测试

$ alias pq_data
alias pq_data='java -jar /parquet_mr安装路径/parquet-tools/target/parquet-tools-1.10.0.jar head -n 2 '

$ pq_data tank.parquet  //查看parquet列中数据
creative_id = 1642531926139931
category_name = 国内线路游
ad_keywords = 爸爸;妈妈;老师;班主任;优良;培养;搞笑;经典;游戏;孩子;学习;教育;书籍;青少年;初中;小学;家长;家庭;学校;挂图;
creative_type = 竖版视频
inventory_type = INVENTORY_FEED;INVENTORY_AWEME_FEED;INVENTORY_HOTSOON_FEED;INVENTORY_VIDEO_FEED;
gender = NONE
source = 硕硕图书
first_industry_name = 传媒及内容
second_industry_name = 书籍杂志

creative_id = 1642465004243004
category_name = 住房中介(公司)
ad_keywords = 58同城;赶集;我爱我家;公寓出租;公寓;整租;单间;合租;租房;租房子;上海租房;上海租房网;租房网;上海合租;房屋出租;出租;公租房;租房攻略;廉租房;毕业租房;
creative_type = 竖版视频
inventory_type = INVENTORY_UNION_SLOT;INVENTORY_FEED;INVENTORY_AWEME_FEED;INVENTORY_HOTSOON_FEED;INVENTORY_VIDEO_FEED;
gender = NONE
source = 上海青客租房
first_industry_name = 房地产
second_industry_name = 房地产中介

$ alias pq_schema
alias pq_schema='java -jar /parquet_mr安装路径/parquet-tools/target/parquet-tools-1.10.0.jar schema '

$ pq_schema tank.parquet   //查看parquet列结构
message spark_schema {
 optional int64 creative_id;
 optional binary category_name (UTF8);
 optional binary ad_keywords (UTF8);
 optional binary creative_type (UTF8);
 optional binary inventory_type (UTF8);
 optional binary gender (UTF8);
 optional binary source (UTF8);
 optional binary advanced_creative_title (UTF8);
 optional binary first_industry_name (UTF8);
 optional binary second_industry_name (UTF8);
}

2,手动编辑安装的parquet-tool包,测试hdfs上的文件

# hadoop jar parquet-tools-1.11.0.jar schema -d hdfs://bigdata1/user/test.parquet |head -n 30
WARNING: Use "yarn jar" to launch YARN applications.
message spark_schema {
 optional binary _c0 (STRING);
 optional binary _c1 (STRING);
 optional binary _c2 (STRING);
 optional binary _c3 (STRING);
 optional binary _c4 (STRING);
 optional binary _c5 (STRING);
 optional binary _c6 (STRING);
 optional binary _c7 (STRING);
 optional binary _c8 (STRING);
 optional binary _c9 (STRING);
}

# hadoop jar parquet-tools-1.11.0.jar head -n 2 hdfs://bigdata1/user/test.parquet
WARNING: Use "yarn jar" to launch YARN applications.
19/12/12 16:58:29 INFO hadoop.ParquetFileReader: Initiating action with parallelism: 5
19/12/12 16:58:29 INFO hadoop.ParquetFileReader: reading another 1 footers
19/12/12 16:58:29 INFO hadoop.ParquetFileReader: Initiating action with parallelism: 5
19/12/12 16:58:30 INFO hadoop.InternalParquetRecordReader: RecordReader initialized will read a total of 18715 records.
19/12/12 16:58:30 INFO hadoop.InternalParquetRecordReader: at row 0. reading next block
19/12/12 16:58:30 INFO compress.CodecPool: Got brand-new decompressor [.snappy]
19/12/12 16:58:30 INFO hadoop.InternalParquetRecordReader: block read in memory in 67 ms. row count = 18715
_c0 = 1642643001630776
_c1 = 眼镜
_c2 = 眼镜;变色镜;夜视镜;太阳镜;男人;女人;驾驶;偏光镜;小姐姐;司机;太阳;夏天;日夜两用;开车;老司机;汽车;有车一族;夏季;清晰;眼睛;
_c3 = 竖版视频
_c4 = INVENTORY_AWEME_FEED;
_c5 = NONE
_c6 = 闻书-免费试用
_c8 = 零售
_c9 = 综合类2C电商

_c0 = 1642637175138360
_c1 = 婚恋服务
_c2 = 陌陌;相亲;交友;恋爱;网友;网恋;婚恋;离异;离婚;再婚;结婚;对象;约会;约炮;同城;直播;单身;脱单;少妇;美女;
_c3 = 大图竖图
_c4 = INVENTORY_HOTSOON_FEED;
_c5 = GENDER_MALE
_c7 = 私密相亲,1对1互动
_c8 = 生活服务
_c9 = 婚恋服务

3,通过pip安装parquet

$ parquet --help
usage: parquet [-h] [--metadata] [--row-group-metadata] [--no-data]
 [--limit LIMIT] [--col COL] [--no-headers] [--format FORMAT]
 [--debug]
 file

Read parquet files

positional arguments:
 file path to the file to parse

optional arguments:
 -h, --help show this help message and exit
 --metadata show metadata on file
 --row-group-metadata show per row group metadata
 --no-data don't dump any data from the file
 --limit LIMIT max records to output
 --col COL only include this column (can be specified multiple
 times)
 --no-headers skip headers in output (only applies if format=csv)
 --format FORMAT format for the output data. can be csv or json.
 --debug log debug info to stderr

$ parquet tank.parquet --format json > test.json //将数据转换成json格式

$ parquet tank.parquet --limit 2 //显示条数据

$ parquet --metadata tank.parquet |less //查看表结构

三,创建表,并导入parquet文件

1,创建表

create table `tank` (
`creative_id` BIGINT,
`category_name` string,
`ad_keywords` string,
`creative_type` string,
`inventory_type` string,
`gender` string,
`source` string,
`advanced_creative_title` string,
`first_industry_name` string,
`second_industry_name` string
)
STORED AS PARQUET TBLPROPERTIES('parquet.compression'='SNAPPY');

表字段对应的类型,要和parquet schema的类型要对上,不然会报以下错误。

Failed with exception java.io.IOException:org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast to org.apache.hadoop.io.LongWritable

还有压缩方式也是一样。默认是gzip的。

2,导入数据

# mkdir /home/hdfs
# cp tank.parquet /home/hdfs
# chown hdfs.hdfs -R /home/hdfs

//方法一
# load data local inpath '/home/hdfs/tank.parquet' OVERWRITE into table tanktest.tank;

//方法二
# sudo -u hdfs hdfs dfs -put /home/hdfs/tank.parquet /home/cdh6/hive/warehouse/tanktest.db/tank/

这二种方法,都可以把文件传上去。装的cdh6,hadoop是3.0的,hdfs的文件权限不是root,所以要改变文件权限。

hive> desc formatted tank;
OK
# col_name            	data_type           	comment             

creative_id         	bigint
category_name       	string
ad_keywords         	string
creative_type       	string
inventory_type      	string
gender              	string
source              	string
advanced_creative_title	string
first_industry_name 	string
second_industry_name	string              	                    

# Detailed Table Information
Database:           	tanktest
OwnerType:          	USER
Owner:              	hdfs
CreateTime:         	Fri Dec 13 09:49:47 CST 2019
LastAccessTime:     	UNKNOWN
Retention:          	0
Location:           	hdfs://bigdata1/home/cdh6/hive/warehouse/tanktest.db/tank
Table Type:         	MANAGED_TABLE
Table Parameters:
	numFiles            	1
	numRows             	0
	parquet.compression 	SNAPPY
	rawDataSize         	0
	totalSize           	660925
	transient_lastDdlTime	1576208724          

# Storage Information
SerDe Library:      	org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
InputFormat:        	org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
OutputFormat:       	org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat
Compressed:         	No
Num Buckets:        	-1
Bucket Columns:     	[]
Sort Columns:       	[]
Storage Desc Params:
	serialization.format	1
Time taken: 0.171 seconds, Fetched: 40 row(s)

hive> select creative_id,category_name from tank limit 2;
OK
1642531926139931	国内线路游
1642465004243004	住房中介(公司)
Time taken: 0.271 seconds, Fetched: 2 row(s)


转载请注明
作者:海底苍鹰
地址:http://blog.51yip.com/hadoop/2322.html