hbase可以进行亿级数据存储,查询数据用scan,get还是比较弱的,不支持sql查询,这就增加了数据分析人员的学习成本。hbase只提供了简单的基于Key值的快速查询能力,要进行复杂的查询功能,通过filter还是比较麻烦的。
不过hive与hbase的整合解决这个问题。不仅如此,还能通过hive将数据批量地导入到hbase中。在看本文前,请查看以下文章
centos7 hadoop2.7.7 hbase1.4安装配置详解
一、整合hive和hbase目的
1、通过hive与hbase整合,可以将hbase的数据通过Hive来分析,让hbase支持JOIN、GROUP等SQL查询语法。
2、实现将批量数据导入到hbase表中。
二,整合hive和hbase
1,修改hive-site.xml
# $HIVE_HOME/conf/hive-site.xml //添加以下内容 <property> <name>hbase.zookeeper.quorum</name> <value>bigserver1:2181,bigserver2:2181,testing:2181</value> </property>
2,同步jar包
# cp -r $HIVE_HOME/lib $HIVE_HOME/lib_bak //备份 //删除hive下的zookeeper和hbase的jar包,从hbase中重新加软连 # rm -f $HIVE_HOME/lib/zookeeper-*.jar # rm -f $HIVE_HOME/lib/hbase*.jar # ln -s $HBASE_HOME/lib/zookeeper-*.jar $HIVE_HOME/lib/ # ln -s $HBASE_HOME/lib/hbase*.jar $HIVE_HOME/lib/ //从hive中加载hive-hbase-handler-2.3.4.jar # ln -s $HIVE_HOME/lib/hive-hbase-handler-2.3.4.jar $HBASE_HOME/lib/
集群中的所有hive节点都要进行该步操作,操作完后,重新启动hadoop集群和hbase集群。
删除hive原来的hbase和zookeeper包,从hbase中同步,是为了jar包的统一性
三,测试hive和hbase整合
1,创建hive external表
hive> CREATE EXTERNAL TABLE hive_hbase_user(key bigint,contact map<string,string>,info map<string,string>,login map<string,string>) > STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' > WITH SERDEPROPERTIES ("hbase.columns.mapping" =":key,contact:,info:,login:") > TBLPROPERTIES ("hbase.table.name" = "test_ns:user"); OK Time taken: 0.16 seconds hive> desc formatted hive_hbase_user; //查看表 OK 2019-10-14 19:59:38,750 INFO [53d52e8a-ead2-47e1-a7a8-d32769c6008d main] mapred.FileInputFormat: Total input paths to process : 1 # col_name data_type comment key bigint contact map<string,string> info map<string,string> login map<string,string> # Detailed Table Information Database: hbase Owner: root CreateTime: Mon Oct 14 19:45:25 CST 2019 LastAccessTime: UNKNOWN Retention: 0 Location: hdfs://bigdata1/user/hive/warehouse/hbase.db/hive_hbase_user Table Type: EXTERNAL_TABLE Table Parameters: COLUMN_STATS_ACCURATE {\"BASIC_STATS\":\"true\"} EXTERNAL TRUE hbase.table.name test_ns:user numFiles 0 numRows 0 rawDataSize 0 storage_handler org.apache.hadoop.hive.hbase.HBaseStorageHandler totalSize 0 transient_lastDdlTime 1571053525 # Storage Information SerDe Library: org.apache.hadoop.hive.hbase.HBaseSerDe InputFormat: null OutputFormat: null Compressed: No Num Buckets: -1 Bucket Columns: [] Sort Columns: [] Storage Desc Params: hbase.columns.mapping :key,contact:,info:,login: serialization.format 1 Time taken: 0.069 seconds, Fetched: 37 row(s)
test_ns:user,这张表,已经存在于hbase中,如果在hbase添加表和数据,请参考:hbase 创建表 增删列
2,查询hive_hbase_user表数据
hive> select * from hive_hbase_user; OK 2019-10-14 19:45:31,502 INFO [53d52e8a-ead2-47e1-a7a8-d32769c6008d main] zookeeper.RecoverableZooKeeper: Process identifier=hconnection-0x12dc702b connecting to ZooKeeper ensemble=bigserver1:2181,bigserver2:2181,testing:2181 2019-10-14 19:45:31,520 INFO [53d52e8a-ead2-47e1-a7a8-d32769c6008d main] util.RegionSizeCalculator: Calculating region sizes for table "test_ns:user". 2019-10-14 19:45:31,541 INFO [53d52e8a-ead2-47e1-a7a8-d32769c6008d main] client.ConnectionManager$HConnectionImplementation: Closing master protocol: MasterService 2019-10-14 19:45:31,541 INFO [53d52e8a-ead2-47e1-a7a8-d32769c6008d main] client.ConnectionManager$HConnectionImplementation: Closing zookeeper sessionid=0x102f638e04f0008 2019-10-14 19:45:31,574 INFO [53d52e8a-ead2-47e1-a7a8-d32769c6008d main] zookeeper.RecoverableZooKeeper: Process identifier=hconnection-0x38b0e2a7 connecting to ZooKeeper ensemble=bigserver1:2181,bigserver2:2181,testing:2181 2019-10-14 19:45:31,587 INFO [53d52e8a-ead2-47e1-a7a8-d32769c6008d main] mapreduce.TableInputFormatBase: Input split length: 0 bytes. 2019-10-14 19:45:31,612 INFO [53d52e8a-ead2-47e1-a7a8-d32769c6008d main] client.ConnectionManager$HConnectionImplementation: Closing zookeeper sessionid=0x20150a9ee50001a 1000120190925 {"mobile":"15854654215","tel":"02112345678"} {"age":"35","sex":"Male"} {"password":"Pass1","username":"Tank"} 1000220190926 {"mobile":"15812345678","tel":"02122345678"} {"age":"35"} {"password":"pass1","username":"test1_111"} 1000320190926 {"mobile":"15822345678","tel":"02132345678"} {"age":"36","sex":"male2"} {"password":"pass2","username":"newtest2"} 1000420190926 {"mobile":"15833345678","tel":"02142345678"} {"age":"37","sex":"male3"} {"password":"pass3","username":"test3"} 1000520190926 {"mobile":"15803345678","tel":"02152345678"} {"age":"38","sex":"none"} {"password":"admin","username":"zhang"} Time taken: 0.757 seconds, Fetched: 5 row(s) hive> select key,contact['mobile'],info['sex'],login from hive_hbase_user; OK 2019-10-14 19:56:31,736 INFO [53d52e8a-ead2-47e1-a7a8-d32769c6008d main] zookeeper.RecoverableZooKeeper: Process identifier=hconnection-0x198524ec connecting to ZooKeeper ensemble=bigserver1:2181,bigserver2:2181,testing:2181 2019-10-14 19:56:31,751 INFO [53d52e8a-ead2-47e1-a7a8-d32769c6008d main] util.RegionSizeCalculator: Calculating region sizes for table "test_ns:user". 2019-10-14 19:56:31,783 INFO [53d52e8a-ead2-47e1-a7a8-d32769c6008d main] client.ConnectionManager$HConnectionImplementation: Closing master protocol: MasterService 2019-10-14 19:56:31,783 INFO [53d52e8a-ead2-47e1-a7a8-d32769c6008d main] client.ConnectionManager$HConnectionImplementation: Closing zookeeper sessionid=0x20150a9ee50001c 2019-10-14 19:56:31,815 INFO [53d52e8a-ead2-47e1-a7a8-d32769c6008d main] zookeeper.RecoverableZooKeeper: Process identifier=hconnection-0x4c5c0306 connecting to ZooKeeper ensemble=bigserver1:2181,bigserver2:2181,testing:2181 2019-10-14 19:56:31,824 INFO [53d52e8a-ead2-47e1-a7a8-d32769c6008d main] mapreduce.TableInputFormatBase: Input split length: 0 bytes. 2019-10-14 19:56:31,844 INFO [53d52e8a-ead2-47e1-a7a8-d32769c6008d main] client.ConnectionManager$HConnectionImplementation: Closing zookeeper sessionid=0x20150a9ee50001d 1000120190925 15854654215 Male {"password":"Pass1","username":"Tank"} 1000220190926 15812345678 NULL {"password":"pass1","username":"test1_111"} 1000320190926 15822345678 male2 {"password":"pass2","username":"newtest2"} 1000420190926 15833345678 male3 {"password":"pass3","username":"test3"} 1000520190926 15803345678 none {"password":"admin","username":"zhang"}
3,向hive_hbase_user插入数据
hive> insert into table hive_hbase_user > select '1000520190926' as key, > map('mobile','11111111','tel','2222222') as contact, > map('age','32','sex','male') as info, > map('username','tanktest','password','111111') as login; WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases. Query ID = root_20191014201725_bfc8a8eb-7c8b-4927-8650-22b956188de9 Total jobs = 1 Launching Job 1 out of 1 Number of reduce tasks is set to 0 since there's no reduce operator 2019-10-14 20:17:26,743 WARN [53d52e8a-ead2-47e1-a7a8-d32769c6008d main] mapreduce.TableMapReduceUtil: The addDependencyJars(Configuration, Class<?>...) method has been deprecated since it is easy to use incorrectly. Most users should rely on addDependencyJars(Job) instead. See HBASE-8386 for more details. 2019-10-14 20:17:26,744 WARN [53d52e8a-ead2-47e1-a7a8-d32769c6008d main] mapreduce.TableMapReduceUtil: The addDependencyJars(Configuration, Class<?>...) method has been deprecated since it is easy to use incorrectly. Most users should rely on addDependencyJars(Job) instead. See HBASE-8386 for more details. 2019-10-14 20:17:26,937 INFO [53d52e8a-ead2-47e1-a7a8-d32769c6008d main] Configuration.deprecation: mapred.submit.replication is deprecated. Instead, use mapreduce.client.submit.file.replication 2019-10-14 20:17:27,191 INFO [53d52e8a-ead2-47e1-a7a8-d32769c6008d main] zookeeper.RecoverableZooKeeper: Process identifier=hconnection-0x4c69b396 connecting to ZooKeeper ensemble=bigserver1:2181,bigserver2:2181,testing:2181 2019-10-14 20:17:27,239 INFO [53d52e8a-ead2-47e1-a7a8-d32769c6008d main] client.ConfiguredRMFailoverProxyProvider: Failing over to rm2 2019-10-14 20:17:27,265 WARN [53d52e8a-ead2-47e1-a7a8-d32769c6008d main] mapreduce.JobResourceUploader: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this. 2019-10-14 20:17:37,935 INFO [53d52e8a-ead2-47e1-a7a8-d32769c6008d main] input.FileInputFormat: Total input paths to process : 1 2019-10-14 20:17:37,937 INFO [53d52e8a-ead2-47e1-a7a8-d32769c6008d main] input.CombineFileInputFormat: DEBUG: Terminated node allocation with : CompletedNodes: 2, size left: 0 2019-10-14 20:17:38,147 INFO [53d52e8a-ead2-47e1-a7a8-d32769c6008d main] mapreduce.JobSubmitter: number of splits:1 2019-10-14 20:17:38,297 INFO [53d52e8a-ead2-47e1-a7a8-d32769c6008d main] mapreduce.JobSubmitter: Submitting tokens for job: job_1570864618819_0045 2019-10-14 20:17:38,657 INFO [53d52e8a-ead2-47e1-a7a8-d32769c6008d main] impl.YarnClientImpl: Submitted application application_1570864618819_0045 2019-10-14 20:17:38,674 INFO [53d52e8a-ead2-47e1-a7a8-d32769c6008d main] mapreduce.Job: The url to track the job: http://testing:8088/proxy/application_1570864618819_0045/ Starting Job = job_1570864618819_0045, Tracking URL = http://testing:8088/proxy/application_1570864618819_0045/ Kill Command = /bigdata/hadoop/bin/hadoop job -kill job_1570864618819_0045 Hadoop job information for Stage-3: number of mappers: 1; number of reducers: 0 2019-10-14 20:17:49,986 WARN [53d52e8a-ead2-47e1-a7a8-d32769c6008d main] mapreduce.Counters: Group org.apache.hadoop.mapred.Task$Counter is deprecated. Use org.apache.hadoop.mapreduce.TaskCounter instead 2019-10-14 20:17:49,981 Stage-3 map = 0%, reduce = 0% 2019-10-14 20:17:59,757 Stage-3 map = 100%, reduce = 0%, Cumulative CPU 4.85 sec MapReduce Total cumulative CPU time: 4 seconds 850 msec Ended Job = job_1570864618819_0045 MapReduce Jobs Launched: Stage-Stage-3: Map: 1 Cumulative CPU: 4.85 sec HDFS Read: 5166 HDFS Write: 0 SUCCESS Total MapReduce CPU Time Spent: 4 seconds 850 msec OK Time taken: 35.122 seconds //通过hive插入hbase真的是慢 hive> insert into table hive_hbase_user > select '1000520191014' as key, > map('mobile','333333','tel','44444') as contact, > map('age','34','sex','male') as info, > map('username','tanktest212','password','111111') as login; WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases. Query ID = root_20191014202113_886635e0-7d9e-4d19-86ce-fd20568a0746 Total jobs = 1 Launching Job 1 out of 1 。。。。。。。。省略。。。。。。。。
注意:第一插入插入语句,其实是更新,因为ROWKEY是相同的。第二条是插入
4,创建hive表
hive> create table hive_hbase_test( > key bigint, > mobile string, > sex string > )STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' > WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,contact:name,info:sex") > TBLPROPERTIES ("hbase.table.name" = "test_ns:hive_hbase_test"); 2019-10-14 20:55:21,374 INFO [53d52e8a-ead2-47e1-a7a8-d32769c6008d main] client.HBaseAdmin: Created test_ns:hive_hbase_test OK Time taken: 3.563 seconds
test_ns:hive_hbase_test该表,在hbase中,并不存在。hive会创建之。
5,导入数据
hive> insert into table hive_hbase_test > select key,contact['mobile'],info['sex'] from hive_hbase_user; WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases. Query ID = root_20191014210101_9b761895-766f-416e-8a32-9bced8fcf939 Total jobs = 1 Launching Job 1 out of 1 Number of reduce tasks is set to 0 since there's no reduce operator 2019-10-14 21:01:01,659 WARN [53d52e8a-ead2-47e1-a7a8-d32769c6008d main] mapreduce.TableMapReduceUtil: The addDependencyJars(Configuration, Class<?>...) method has been deprecated since it is easy to use incorrectly. Most users should rely on addDependencyJars(Job) instead. See HBASE-8386 for more details. 2019-10-14 21:01:01,659 WARN [53d52e8a-ead2-47e1-a7a8-d32769c6008d main] mapreduce.TableMapReduceUtil: The addDependencyJars(Configuration, Class<?>...) method has been deprecated since it is easy to use incorrectly. Most users should rely on addDependencyJars(Job) instead. See HBASE-8386 for more details. 2019-10-14 21:01:01,682 WARN [53d52e8a-ead2-47e1-a7a8-d32769c6008d main] mapreduce.TableMapReduceUtil: The addDependencyJars(Configuration, Class<?>...) method has been deprecated since it is easy to use incorrectly. Most users should rely on addDependencyJars(Job) instead. See HBASE-8386 for more details. 2019-10-14 21:01:01,682 WARN [53d52e8a-ead2-47e1-a7a8-d32769c6008d main] mapreduce.TableMapReduceUtil: The addDependencyJars(Configuration, Class<?>...) method has been deprecated since it is easy to use incorrectly. Most users should rely on addDependencyJars(Job) instead. See HBASE-8386 for more details. 2019-10-14 21:01:01,899 INFO [53d52e8a-ead2-47e1-a7a8-d32769c6008d main] zookeeper.RecoverableZooKeeper: Process identifier=hconnection-0x628fa8ea connecting to ZooKeeper ensemble=bigserver1:2181,bigserver2:2181,testing:2181 2019-10-14 21:01:01,931 INFO [53d52e8a-ead2-47e1-a7a8-d32769c6008d main] client.ConfiguredRMFailoverProxyProvider: Failing over to rm2 2019-10-14 21:01:01,933 WARN [53d52e8a-ead2-47e1-a7a8-d32769c6008d main] mapreduce.JobResourceUploader: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this. 2019-10-14 21:01:12,752 INFO [53d52e8a-ead2-47e1-a7a8-d32769c6008d main] zookeeper.RecoverableZooKeeper: Process identifier=hconnection-0x5f6e6902 connecting to ZooKeeper ensemble=bigserver1:2181,bigserver2:2181,testing:2181 2019-10-14 21:01:12,768 INFO [53d52e8a-ead2-47e1-a7a8-d32769c6008d main] util.RegionSizeCalculator: Calculating region sizes for table "test_ns:user". 2019-10-14 21:01:12,816 INFO [53d52e8a-ead2-47e1-a7a8-d32769c6008d main] client.ConnectionManager$HConnectionImplementation: Closing master protocol: MasterService 2019-10-14 21:01:12,816 INFO [53d52e8a-ead2-47e1-a7a8-d32769c6008d main] client.ConnectionManager$HConnectionImplementation: Closing zookeeper sessionid=0x20150a9ee500020 2019-10-14 21:01:13,038 INFO [53d52e8a-ead2-47e1-a7a8-d32769c6008d main] mapreduce.JobSubmitter: number of splits:1 2019-10-14 21:01:13,163 INFO [53d52e8a-ead2-47e1-a7a8-d32769c6008d main] mapreduce.JobSubmitter: Submitting tokens for job: job_1570864618819_0047 2019-10-14 21:01:13,394 INFO [53d52e8a-ead2-47e1-a7a8-d32769c6008d main] impl.YarnClientImpl: Submitted application application_1570864618819_0047 2019-10-14 21:01:13,395 INFO [53d52e8a-ead2-47e1-a7a8-d32769c6008d main] mapreduce.Job: The url to track the job: http://testing:8088/proxy/application_1570864618819_0047/ Starting Job = job_1570864618819_0047, Tracking URL = http://testing:8088/proxy/application_1570864618819_0047/ Kill Command = /bigdata/hadoop/bin/hadoop job -kill job_1570864618819_0047 Hadoop job information for Stage-3: number of mappers: 1; number of reducers: 0 2019-10-14 21:01:22,735 WARN [53d52e8a-ead2-47e1-a7a8-d32769c6008d main] mapreduce.Counters: Group org.apache.hadoop.mapred.Task$Counter is deprecated. Use org.apache.hadoop.mapreduce.TaskCounter instead 2019-10-14 21:01:22,734 Stage-3 map = 0%, reduce = 0% 2019-10-14 21:01:32,251 Stage-3 map = 100%, reduce = 0%, Cumulative CPU 4.53 sec MapReduce Total cumulative CPU time: 4 seconds 530 msec Ended Job = job_1570864618819_0047 MapReduce Jobs Launched: Stage-Stage-3: Map: 1 Cumulative CPU: 4.53 sec HDFS Read: 5165 HDFS Write: 0 SUCCESS Total MapReduce CPU Time Spent: 4 seconds 530 msec OK Time taken: 33.251 seconds hive> select * from hive_hbase_test; OK 2019-10-14 21:01:53,965 INFO [53d52e8a-ead2-47e1-a7a8-d32769c6008d main] zookeeper.RecoverableZooKeeper: Process identifier=hconnection-0x2e3fdf30 connecting to ZooKeeper ensemble=bigserver1:2181,bigserver2:2181,testing:2181 2019-10-14 21:01:53,983 INFO [53d52e8a-ead2-47e1-a7a8-d32769c6008d main] util.RegionSizeCalculator: Calculating region sizes for table "test_ns:hive_hbase_test". 2019-10-14 21:01:54,033 INFO [53d52e8a-ead2-47e1-a7a8-d32769c6008d main] client.ConnectionManager$HConnectionImplementation: Closing master protocol: MasterService 2019-10-14 21:01:54,033 INFO [53d52e8a-ead2-47e1-a7a8-d32769c6008d main] client.ConnectionManager$HConnectionImplementation: Closing zookeeper sessionid=0x102f638e04f0010 2019-10-14 21:01:54,070 INFO [53d52e8a-ead2-47e1-a7a8-d32769c6008d main] zookeeper.RecoverableZooKeeper: Process identifier=hconnection-0xe8bd42f connecting to ZooKeeper ensemble=bigserver1:2181,bigserver2:2181,testing:2181 2019-10-14 21:01:54,082 INFO [53d52e8a-ead2-47e1-a7a8-d32769c6008d main] mapreduce.TableInputFormatBase: Input split length: 0 bytes. 2019-10-14 21:01:54,186 INFO [53d52e8a-ead2-47e1-a7a8-d32769c6008d main] client.ConnectionManager$HConnectionImplementation: Closing zookeeper sessionid=0x20150a9ee500021 1000120190925 15854654215 Male 1000220190926 15812345678 NULL 1000320190926 15822345678 male2 1000420190926 15833345678 male3 1000520190926 11111111 male 1000520191014 333333 male Time taken: 0.285 seconds, Fetched: 6 row(s)
从hive_hbase_user表中,往hive_hbase_test表中导入部分数据
6,通过hbase shell查看表
hbase(main):011:0> list TABLE test_ns:hive_hbase_test test_ns:t1 test_ns:user 3 row(s) in 0.0360 seconds => ["test_ns:hive_hbase_test", "test_ns:t1", "test_ns:user"] hbase(main):012:0> scan "test_ns:hive_hbase_test" ROW COLUMN+CELL 1000120190925 column=contact:name, timestamp=1571058091498, value=15854654215 1000120190925 column=info:sex, timestamp=1571058091498, value=Male 1000220190926 column=contact:name, timestamp=1571058091498, value=15812345678 1000320190926 column=contact:name, timestamp=1571058091498, value=15822345678 1000320190926 column=info:sex, timestamp=1571058091498, value=male2 1000420190926 column=contact:name, timestamp=1571058091498, value=15833345678 1000420190926 column=info:sex, timestamp=1571058091498, value=male3 1000520190926 column=contact:name, timestamp=1571058091498, value=11111111 1000520190926 column=info:sex, timestamp=1571058091498, value=male 1000520191014 column=contact:name, timestamp=1571058091498, value=333333 1000520191014 column=info:sex, timestamp=1571058091498, value=male 6 row(s) in 0.0960 seconds
通过hive创建的表以及数据,通过hbase shell是可以访问的
7,查看hive的表目录
[root@bigserver1 bin]# hdfs dfs -ls /user/hive/warehouse/hbase.db Found 4 items drwxr-xr-x - root supergroup 0 2019-10-14 21:01 /user/hive/warehouse/hbase.db/_tmp.hive_hbase_test //多出一个文件夹 drwxr-xr-x - root supergroup 0 2019-10-14 20:17 /user/hive/warehouse/hbase.db/_tmp.hive_hbase_user //多出一个文件夹 drwxrwxrwx - root supergroup 0 2019-10-14 21:01 /user/hive/warehouse/hbase.db/hive_hbase_test drwxrwxrwx - root supergroup 0 2019-10-14 20:21 /user/hive/warehouse/hbase.db/hive_hbase_user [root@bigserver2 bin]# hdfs dfs -ls /user/hive/warehouse/hbase.db/hive_hbase_test //空目录 [root@bigserver2 bin]# hdfs dfs -ls /user/hive/warehouse/hbase.db/hive_hbase_user //空目录
hive和hbase整合后,通过hive创建的表,会多出来一个临时目录。不管是external表,还是非external表,数据都不存在hive的表对应的目录中。
转载请注明
作者:海底苍鹰
地址:http://blog.51yip.com/hadoop/2196.html