网上购物,很多人只会看最近三个月,或者半年的订单,更早的历史订单就很少有人关注了。这种场景还是挺适合kudu+hdfs.
Apache Kudu旨在快速分析、快速变化的数据。Kudu提供快速插入/更新和高效列扫描的组合,以在单个存储层上实现多个实时分析工作负载。因此,Kudu非常适合作为存储需要实时查询的数据的仓库。此外,Kudu支持实时更新和删除行,以支持延迟到达的数据和数据更正。
Apache HDFS旨在以低成本实现无限的可扩展性。它针对数据不可变的面向批处理的场景进行了优化,与Apache Parquet文件格式配合使用时,可以以极高的吞吐量和效率访问结构化数据。
kudu存最近半年的订单数据,hdfs存历史的订单数据。
1,创建kudu测试表
[bigserver2:21000] order_test> CREATE TABLE tank_order_kudu > ( > id BIGINT, > month STRING, > name STRING, > order_time TIMESTAMP, > PRIMARY KEY(id,month) > ) > PARTITION BY HASH(id) PARTITIONS 5, > RANGE(month) ( > PARTITION VALUE="2019-07", > PARTITION VALUE="2019-08", > PARTITION VALUE="2019-09", > PARTITION VALUE="2019-10", > PARTITION VALUE="2019-11", > PARTITION VALUE="2019-12", > PARTITION VALUE="2020-01" > ) STORED AS KUDU; Query: CREATE TABLE tank_order_kudu ( id BIGINT, month STRING, name STRING, order_time TIMESTAMP, PRIMARY KEY(id,month) ) PARTITION BY HASH(id) PARTITIONS 5, RANGE(month) ( PARTITION VALUE="2019-07", PARTITION VALUE="2019-08", PARTITION VALUE="2019-09", PARTITION VALUE="2019-10", PARTITION VALUE="2019-11", PARTITION VALUE="2019-12", PARTITION VALUE="2020-01" ) STORED AS KUDU +-------------------------+ | summary | +-------------------------+ | Table has been created. | +-------------------------+ Fetched 1 row(s) in 1.18s
2,创建hdfs测试表
[bigserver2:21000] order_test> CREATE TABLE tank_order_hdfs > ( > id BIGINT, > name STRING, > order_time TIMESTAMP > ) > PARTITIONED BY (month STRING) > STORED AS PARQUET; Query: CREATE TABLE tank_order_hdfs ( id BIGINT, name STRING, order_time TIMESTAMP ) PARTITIONED BY (month STRING) STORED AS PARQUET +-------------------------+ | summary | +-------------------------+ | Table has been created. | +-------------------------+ Fetched 1 row(s) in 0.14s
3,kudu表插入数据
[bigserver2:21000] order_test> INSERT INTO tank_order_kudu VALUES > (100,"2019-07",'tank','2019-07-01'), > (101,"2019-07",'test','2019-07-21'), > (102,"2019-08",'zhang','2019-08-12'), > (103,"2019-09",'tankzhang','2019-09-10'), > (104,"2019-10",'hao', '2019-10-15'), > (105,"2019-11",'tank1', '2019-11-09'), > (106,"2019-12",'tank2', '2019-12-23'), > (107,"2020-01",'tank3', '2020-01-11'); Query: INSERT INTO tank_order_kudu VALUES (100,"2019-07",'tank','2019-07-01'), (101,"2019-07",'test','2019-07-21'), (102,"2019-08",'zhang','2019-08-12'), (103,"2019-09",'tankzhang','2019-09-10'), (104,"2019-10",'hao', '2019-10-15'), (105,"2019-11",'tank1', '2019-11-09'), (106,"2019-12",'tank2', '2019-12-23'), (107,"2020-01",'tank3', '2020-01-11') Query submitted at: 2020-01-14 17:42:28 (Coordinator: http://bigserver2:25000) Query progress can be monitored at: http://bigserver2:25000/query_plan?query_id=db4c3023789860ec:195019ee00000000 Modified 8 row(s), 0 row error(s) in 3.90s
4,hdfs表插入数据
[bigserver2:21000] order_test> INSERT INTO tank_order_hdfs (id,name,order_time,month) VALUES > (11,'test','2019-01-21',"2019-01"), > (12,'zhang','2019-02-12',"2019-02"), > (13,'tankzhang','2019-03-10',"2019-03"), > (14,'hao', '2019-04-15',"2019-04"), > (15,'tank1', '2019-05-09',"2019-05"), > (16,'tank2', '2019-06-23',"2019-06"); Query: INSERT INTO tank_order_hdfs (id,name,order_time,month) VALUES (11,'test','2019-01-21',"2019-01"), (12,'zhang','2019-02-12',"2019-02"), (13,'tankzhang','2019-03-10',"2019-03"), (14,'hao', '2019-04-15',"2019-04"), (15,'tank1', '2019-05-09',"2019-05"), (16,'tank2', '2019-06-23',"2019-06") Query submitted at: 2020-01-14 17:43:20 (Coordinator: http://bigserver2:25000) Query progress can be monitored at: http://bigserver2:25000/query_plan?query_id=85464b9cc23eb494:23e5752e00000000 Modified 6 row(s) in 3.54s
kudu表的数据,根hdfs表的数据,没有月份上重复。
5,创建视图
[bigserver2:21000] order_test> CREATE VIEW tank_order_view AS > SELECT id,month,name,order_time > FROM tank_order_kudu > WHERE month >= "2019-07" > UNION ALL > SELECT id,month,name,order_time > FROM tank_order_hdfs > WHERE month < "2019-07"; Query: CREATE VIEW tank_order_view AS SELECT id,month,name,order_time FROM tank_order_kudu WHERE month >= "2019-07" UNION ALL SELECT id,month,name,order_time FROM tank_order_hdfs WHERE month < "2019-07" +------------------------+ | summary | +------------------------+ | View has been created. | +------------------------+ Fetched 1 row(s) in 0.05s
6,查看view和表数据
[bigserver2:21000] order_test> select * from tank_order_view order by id; Query: select * from tank_order_view order by id Query submitted at: 2020-01-14 17:44:39 (Coordinator: http://bigserver2:25000) Query progress can be monitored at: http://bigserver2:25000/query_plan?query_id=7d4a82f5ff1a610c:2c92cb8d00000000 +-----+---------+-----------+---------------------+ | id | month | name | order_time | +-----+---------+-----------+---------------------+ | 11 | 2019-01 | test | 2019-01-21 00:00:00 | | 12 | 2019-02 | zhang | 2019-02-12 00:00:00 | | 13 | 2019-03 | tankzhang | 2019-03-10 00:00:00 | | 14 | 2019-04 | hao | 2019-04-15 00:00:00 | | 15 | 2019-05 | tank1 | 2019-05-09 00:00:00 | | 16 | 2019-06 | tank2 | 2019-06-23 00:00:00 | | 100 | 2019-07 | tank | 2019-07-01 00:00:00 | | 101 | 2019-07 | test | 2019-07-21 00:00:00 | | 102 | 2019-08 | zhang | 2019-08-12 00:00:00 | | 103 | 2019-09 | tankzhang | 2019-09-10 00:00:00 | | 104 | 2019-10 | hao | 2019-10-15 00:00:00 | | 105 | 2019-11 | tank1 | 2019-11-09 00:00:00 | | 106 | 2019-12 | tank2 | 2019-12-23 00:00:00 | | 107 | 2020-01 | tank3 | 2020-01-11 00:00:00 | +-----+---------+-----------+---------------------+ Fetched 14 row(s) in 3.73s [bigserver2:21000] order_test> select * from tank_order_kudu; Query: select * from tank_order_kudu Query submitted at: 2020-01-14 17:45:04 (Coordinator: http://bigserver2:25000) Query progress can be monitored at: http://bigserver2:25000/query_plan?query_id=0642df8221628792:066e868100000000 +-----+---------+-----------+---------------------+ | id | month | name | order_time | +-----+---------+-----------+---------------------+ | 100 | 2019-07 | tank | 2019-07-01 00:00:00 | | 102 | 2019-08 | zhang | 2019-08-12 00:00:00 | | 101 | 2019-07 | test | 2019-07-21 00:00:00 | | 104 | 2019-10 | hao | 2019-10-15 00:00:00 | | 105 | 2019-11 | tank1 | 2019-11-09 00:00:00 | | 106 | 2019-12 | tank2 | 2019-12-23 00:00:00 | | 107 | 2020-01 | tank3 | 2020-01-11 00:00:00 | | 103 | 2019-09 | tankzhang | 2019-09-10 00:00:00 | +-----+---------+-----------+---------------------+ Fetched 8 row(s) in 0.14s [bigserver2:21000] order_test> select * from tank_order_hdfs; Query: select * from tank_order_hdfs Query submitted at: 2020-01-14 17:45:09 (Coordinator: http://bigserver2:25000) Query progress can be monitored at: http://bigserver2:25000/query_plan?query_id=45489970a4fece53:13c264f200000000 +----+-----------+---------------------+---------+ | id | name | order_time | month | +----+-----------+---------------------+---------+ | 12 | zhang | 2019-02-12 00:00:00 | 2019-02 | | 16 | tank2 | 2019-06-23 00:00:00 | 2019-06 | | 15 | tank1 | 2019-05-09 00:00:00 | 2019-05 | | 13 | tankzhang | 2019-03-10 00:00:00 | 2019-03 | | 11 | test | 2019-01-21 00:00:00 | 2019-01 | | 14 | hao | 2019-04-15 00:00:00 | 2019-04 | +----+-----------+---------------------+---------+ Fetched 6 row(s) in 0.12s
二张表的数据,都能在视图中体现
7,查看表分区
[bigserver2:21000] order_test> show range partitions tank_order_kudu; Query: show range partitions tank_order_kudu +-------------------+ | RANGE (month) | +-------------------+ | VALUE = "2019-07" | | VALUE = "2019-08" | | VALUE = "2019-09" | | VALUE = "2019-10" | | VALUE = "2019-11" | | VALUE = "2019-12" | | VALUE = "2020-01" | +-------------------+ Fetched 7 row(s) in 0.03s [bigserver2:21000] order_test> show partitions tank_order_hdfs; Query: show partitions tank_order_hdfs +---------+-------+--------+--------+--------------+-------------------+---------+-------------------+--------------------------------------------------------------------------------------+ | month | #Rows | #Files | Size | Bytes Cached | Cache Replication | Format | Incremental stats | Location | +---------+-------+--------+--------+--------------+-------------------+---------+-------------------+--------------------------------------------------------------------------------------+ | 2019-01 | -1 | 1 | 774B | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://bigdata1/home/cdh6/hive/warehouse/order_test.db/tank_order_hdfs/month=2019-01 | | 2019-02 | -1 | 1 | 779B | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://bigdata1/home/cdh6/hive/warehouse/order_test.db/tank_order_hdfs/month=2019-02 | | 2019-03 | -1 | 1 | 799B | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://bigdata1/home/cdh6/hive/warehouse/order_test.db/tank_order_hdfs/month=2019-03 | | 2019-04 | -1 | 1 | 769B | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://bigdata1/home/cdh6/hive/warehouse/order_test.db/tank_order_hdfs/month=2019-04 | | 2019-05 | -1 | 1 | 779B | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://bigdata1/home/cdh6/hive/warehouse/order_test.db/tank_order_hdfs/month=2019-05 | | 2019-06 | -1 | 1 | 779B | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://bigdata1/home/cdh6/hive/warehouse/order_test.db/tank_order_hdfs/month=2019-06 | | Total | -1 | 6 | 4.57KB | 0B | | | | | +---------+-------+--------+--------+--------------+-------------------+---------+-------------------+--------------------------------------------------------------------------------------+ Fetched 7 row(s) in 0.02s
8,准备impala sql脚本
# cat tank_order.sql use order_test; //选择数据库 //将kudu表某月数据插入到hdfs表,这个时候,二表有重复数据 INSERT INTO ${var:hdfs_table} PARTITION (month) SELECT id,name,order_time,month FROM ${var:kudu_table} WHERE month = from_unixtime(unix_timestamp(add_months(now(), -6)),'yyyy-MM'); //优化表,impala查询更快 COMPUTE INCREMENTAL STATS ${var:hdfs_table}; //修改视图,这样解决数据重复问题 ALTER VIEW ${var:view_name} AS SELECT id,month,name,order_time FROM ${var:kudu_table} WHERE month >= from_unixtime(unix_timestamp(add_months(now(), -5)),'yyyy-MM') UNION ALL SELECT id,month,name,order_time FROM ${var:hdfs_table} WHERE month < from_unixtime(unix_timestamp(add_months(now(), -5)),'yyyy-MM'); //删除kudu表旧分区,等于删除了kudu表中,在hdfs表中的数据 ALTER TABLE ${var:kudu_table} DROP RANGE PARTITION VALUE=from_unixtime(unix_timestamp(add_months(now(), -6)),'yyyy-MM'); //增加kudu表新分区 ALTER TABLE ${var:kudu_table} ADD RANGE PARTITION VALUE=from_unixtime(unix_timestamp(add_months(now(), +1)),'yyyy-MM');
9,执行脚本
[root@bigserver2 tanktest]# impala-shell -i bigserver2:21000 -f tank_order.sql --var=view_name=tank_order_view --var=kudu_table=tank_order_kudu --var=hdfs_table=tank_order_hdfs Starting Impala Shell without Kerberos authentication Opened TCP connection to bigserver2:21000 Connected to bigserver2:21000 Server version: impalad version 3.2.0-cdh6.3.1 RELEASE (build 3d5de689e9b9cfde0c01e1c64f4b4e830cee69c3) Query: use order_test Query: INSERT INTO tank_order_hdfs PARTITION (month) SELECT id,name,order_time,month FROM tank_order_kudu WHERE month = from_unixtime(unix_timestamp(add_months(now(), -6)),'yyyy-MM') Query submitted at: 2020-01-14 17:49:22 (Coordinator: http://bigserver2:25000) Query progress can be monitored at: http://bigserver2:25000/query_plan?query_id=44439e6c6e711ac4:0105b2f900000000 Modified 2 row(s) in 0.67s Query: COMPUTE INCREMENTAL STATS tank_order_hdfs +-----------------------------------------+ | summary | +-----------------------------------------+ | Updated 7 partition(s) and 3 column(s). | +-----------------------------------------+ Fetched 1 row(s) in 1.32s Query: ALTER VIEW tank_order_view AS SELECT id,month,name,order_time FROM tank_order_kudu WHERE month >= from_unixtime(unix_timestamp(add_months(now(), -5)),'yyyy-MM') UNION ALL SELECT id,month,name,order_time FROM tank_order_hdfs WHERE month < from_unixtime(unix_timestamp(add_months(now(), -5)),'yyyy-MM') +------------------------+ | summary | +------------------------+ | View has been altered. | +------------------------+ Fetched 1 row(s) in 0.07s Query: ALTER TABLE tank_order_kudu DROP RANGE PARTITION VALUE=from_unixtime(unix_timestamp(add_months(now(), -6)),'yyyy-MM') +-----------------------------------+ | summary | +-----------------------------------+ | Range partition has been dropped. | +-----------------------------------+ Fetched 1 row(s) in 0.13s Query: ALTER TABLE tank_order_kudu ADD RANGE PARTITION VALUE=from_unixtime(unix_timestamp(add_months(now(), +1)),'yyyy-MM') +---------------------------------+ | summary | +---------------------------------+ | Range partition has been added. | +---------------------------------+ Fetched 1 row(s) in 0.17s
10,执行后,查看数据
[bigserver2:21000] order_test> select * from tank_order_view order by id; Query: select * from tank_order_view order by id Query submitted at: 2020-01-14 17:50:50 (Coordinator: http://bigserver2:25000) Query progress can be monitored at: http://bigserver2:25000/query_plan?query_id=0042af35d2363548:5e6046a400000000 +-----+---------+-----------+---------------------+ | id | month | name | order_time | +-----+---------+-----------+---------------------+ | 11 | 2019-01 | test | 2019-01-21 00:00:00 | | 12 | 2019-02 | zhang | 2019-02-12 00:00:00 | | 13 | 2019-03 | tankzhang | 2019-03-10 00:00:00 | | 14 | 2019-04 | hao | 2019-04-15 00:00:00 | | 15 | 2019-05 | tank1 | 2019-05-09 00:00:00 | | 16 | 2019-06 | tank2 | 2019-06-23 00:00:00 | | 100 | 2019-07 | tank | 2019-07-01 00:00:00 | | 101 | 2019-07 | test | 2019-07-21 00:00:00 | | 102 | 2019-08 | zhang | 2019-08-12 00:00:00 | | 103 | 2019-09 | tankzhang | 2019-09-10 00:00:00 | | 104 | 2019-10 | hao | 2019-10-15 00:00:00 | | 105 | 2019-11 | tank1 | 2019-11-09 00:00:00 | | 106 | 2019-12 | tank2 | 2019-12-23 00:00:00 | | 107 | 2020-01 | tank3 | 2020-01-11 00:00:00 | +-----+---------+-----------+---------------------+ Fetched 14 row(s) in 0.29s [bigserver2:21000] order_test> select * from tank_order_kudu; Query: select * from tank_order_kudu Query submitted at: 2020-01-14 17:51:06 (Coordinator: http://bigserver2:25000) Query progress can be monitored at: http://bigserver2:25000/query_plan?query_id=a14e07360ee03232:d8be8fe000000000 +-----+---------+-----------+---------------------+ | id | month | name | order_time | +-----+---------+-----------+---------------------+ | 107 | 2020-01 | tank3 | 2020-01-11 00:00:00 | | 104 | 2019-10 | hao | 2019-10-15 00:00:00 | | 103 | 2019-09 | tankzhang | 2019-09-10 00:00:00 | | 102 | 2019-08 | zhang | 2019-08-12 00:00:00 | | 105 | 2019-11 | tank1 | 2019-11-09 00:00:00 | | 106 | 2019-12 | tank2 | 2019-12-23 00:00:00 | +-----+---------+-----------+---------------------+ Fetched 6 row(s) in 0.14s [bigserver2:21000] order_test> select * from tank_order_hdfs; Query: select * from tank_order_hdfs Query submitted at: 2020-01-14 17:51:53 (Coordinator: http://bigserver2:25000) Query progress can be monitored at: http://bigserver2:25000/query_plan?query_id=204219b3ecc74213:8bbc30bc00000000 +-----+-----------+---------------------+---------+ | id | name | order_time | month | +-----+-----------+---------------------+---------+ | 12 | zhang | 2019-02-12 00:00:00 | 2019-02 | | 13 | tankzhang | 2019-03-10 00:00:00 | 2019-03 | | 14 | hao | 2019-04-15 00:00:00 | 2019-04 | | 15 | tank1 | 2019-05-09 00:00:00 | 2019-05 | | 16 | tank2 | 2019-06-23 00:00:00 | 2019-06 | | 101 | test | 2019-07-21 00:00:00 | 2019-07 | | 100 | tank | 2019-07-01 00:00:00 | 2019-07 | | 11 | test | 2019-01-21 00:00:00 | 2019-01 | +-----+-----------+---------------------+---------+ Fetched 8 row(s) in 0.14s
从执行后的数据可以看出,视图数据是没有变的。kudu表的部分数据迁到了hdfs表中了。
11,查看分区
[bigserver2:21000] order_test> show range partitions tank_order_kudu; Query: show range partitions tank_order_kudu +-------------------+ | RANGE (month) | +-------------------+ | VALUE = "2019-08" | | VALUE = "2019-09" | | VALUE = "2019-10" | | VALUE = "2019-11" | | VALUE = "2019-12" | | VALUE = "2020-01" | | VALUE = "2020-02" | +-------------------+ Fetched 7 row(s) in 0.05s [bigserver2:21000] order_test> show partitions tank_order_hdfs; Query: show partitions tank_order_hdfs +---------+-------+--------+--------+--------------+-------------------+---------+-------------------+--------------------------------------------------------------------------------------+ | month | #Rows | #Files | Size | Bytes Cached | Cache Replication | Format | Incremental stats | Location | +---------+-------+--------+--------+--------------+-------------------+---------+-------------------+--------------------------------------------------------------------------------------+ | 2019-01 | 1 | 1 | 774B | NOT CACHED | NOT CACHED | PARQUET | true | hdfs://bigdata1/home/cdh6/hive/warehouse/order_test.db/tank_order_hdfs/month=2019-01 | | 2019-02 | 1 | 1 | 779B | NOT CACHED | NOT CACHED | PARQUET | true | hdfs://bigdata1/home/cdh6/hive/warehouse/order_test.db/tank_order_hdfs/month=2019-02 | | 2019-03 | 1 | 1 | 799B | NOT CACHED | NOT CACHED | PARQUET | true | hdfs://bigdata1/home/cdh6/hive/warehouse/order_test.db/tank_order_hdfs/month=2019-03 | | 2019-04 | 1 | 1 | 769B | NOT CACHED | NOT CACHED | PARQUET | true | hdfs://bigdata1/home/cdh6/hive/warehouse/order_test.db/tank_order_hdfs/month=2019-04 | | 2019-05 | 1 | 1 | 779B | NOT CACHED | NOT CACHED | PARQUET | true | hdfs://bigdata1/home/cdh6/hive/warehouse/order_test.db/tank_order_hdfs/month=2019-05 | | 2019-06 | 1 | 1 | 779B | NOT CACHED | NOT CACHED | PARQUET | true | hdfs://bigdata1/home/cdh6/hive/warehouse/order_test.db/tank_order_hdfs/month=2019-06 | | 2019-07 | 2 | 1 | 798B | NOT CACHED | NOT CACHED | PARQUET | true | hdfs://bigdata1/home/cdh6/hive/warehouse/order_test.db/tank_order_hdfs/month=2019-07 | | Total | 8 | 7 | 5.35KB | 0B | | | | | +---------+-------+--------+--------+--------------+-------------------+---------+-------------------+--------------------------------------------------------------------------------------+
kudu表,少了一个旧分区,多了一个新分区。
hdfs表,多了一个分区。
转载请注明
作者:海底苍鹰
地址:http://blog.51yip.com/hadoop/2348.html