kudu hdfs 数据分层 灵活存储

张映 发表于 2020-01-14

分类目录: hadoop/spark/scala

标签:, ,

网上购物,很多人只会看最近三个月,或者半年的订单,更早的历史订单就很少有人关注了。这种场景还是挺适合kudu+hdfs.

Apache Kudu旨在快速分析、快速变化的数据。Kudu提供快速插入/更新和高效列扫描的组合,以在单个存储层上实现多个实时分析工作负载。因此,Kudu非常适合作为存储需要实时查询的数据的仓库。此外,Kudu支持实时更新和删除行,以支持延迟到达的数据和数据更正。

Apache HDFS旨在以低成本实现无限的可扩展性。它针对数据不可变的面向批处理的场景进行了优化,与Apache Parquet文件格式配合使用时,可以以极高的吞吐量和效率访问结构化数据。

kudu存最近半年的订单数据,hdfs存历史的订单数据。

1,创建kudu测试表

[bigserver2:21000] order_test> CREATE TABLE tank_order_kudu
                             > (
                             >   id BIGINT,
                             >   month STRING,
                             >   name STRING,
                             >   order_time TIMESTAMP,
                             >   PRIMARY KEY(id,month)
                             > )
                             > PARTITION BY HASH(id) PARTITIONS 5,
                             > RANGE(month) (
                             >       PARTITION VALUE="2019-07",
                             >       PARTITION VALUE="2019-08",
                             >       PARTITION VALUE="2019-09",
                             >       PARTITION VALUE="2019-10",
                             >       PARTITION VALUE="2019-11",
                             >       PARTITION VALUE="2019-12",
                             >       PARTITION VALUE="2020-01"
                             > ) STORED AS KUDU;
Query: CREATE TABLE tank_order_kudu
(
  id BIGINT,
  month STRING,
  name STRING,
  order_time TIMESTAMP,
  PRIMARY KEY(id,month)
)
PARTITION BY HASH(id) PARTITIONS 5,
RANGE(month) (
      PARTITION VALUE="2019-07",
      PARTITION VALUE="2019-08",
      PARTITION VALUE="2019-09",
      PARTITION VALUE="2019-10",
      PARTITION VALUE="2019-11",
      PARTITION VALUE="2019-12",
      PARTITION VALUE="2020-01"
) STORED AS KUDU
+-------------------------+
| summary                 |
+-------------------------+
| Table has been created. |
+-------------------------+
Fetched 1 row(s) in 1.18s

2,创建hdfs测试表

[bigserver2:21000] order_test> CREATE TABLE tank_order_hdfs
                             > (
                             >   id BIGINT,
                             >   name STRING,
                             >   order_time TIMESTAMP
                             > )
                             > PARTITIONED BY (month STRING)
                             > STORED AS PARQUET;
Query: CREATE TABLE tank_order_hdfs
(
  id BIGINT,
  name STRING,
  order_time TIMESTAMP
)
PARTITIONED BY (month STRING)
STORED AS PARQUET
+-------------------------+
| summary                 |
+-------------------------+
| Table has been created. |
+-------------------------+
Fetched 1 row(s) in 0.14s

3,kudu表插入数据

[bigserver2:21000] order_test> INSERT INTO tank_order_kudu VALUES
                             > (100,"2019-07",'tank','2019-07-01'),
                             > (101,"2019-07",'test','2019-07-21'),
                             > (102,"2019-08",'zhang','2019-08-12'),
                             > (103,"2019-09",'tankzhang','2019-09-10'),
                             > (104,"2019-10",'hao', '2019-10-15'),
                             > (105,"2019-11",'tank1', '2019-11-09'),
                             > (106,"2019-12",'tank2', '2019-12-23'),
                             > (107,"2020-01",'tank3', '2020-01-11');
Query: INSERT INTO tank_order_kudu VALUES
(100,"2019-07",'tank','2019-07-01'),
(101,"2019-07",'test','2019-07-21'),
(102,"2019-08",'zhang','2019-08-12'),
(103,"2019-09",'tankzhang','2019-09-10'),
(104,"2019-10",'hao', '2019-10-15'),
(105,"2019-11",'tank1', '2019-11-09'),
(106,"2019-12",'tank2', '2019-12-23'),
(107,"2020-01",'tank3', '2020-01-11')
Query submitted at: 2020-01-14 17:42:28 (Coordinator: http://bigserver2:25000)
Query progress can be monitored at: http://bigserver2:25000/query_plan?query_id=db4c3023789860ec:195019ee00000000
Modified 8 row(s), 0 row error(s) in 3.90s

4,hdfs表插入数据

[bigserver2:21000] order_test> INSERT INTO tank_order_hdfs (id,name,order_time,month) VALUES
                             > (11,'test','2019-01-21',"2019-01"),
                             > (12,'zhang','2019-02-12',"2019-02"),
                             > (13,'tankzhang','2019-03-10',"2019-03"),
                             > (14,'hao', '2019-04-15',"2019-04"),
                             > (15,'tank1', '2019-05-09',"2019-05"),
                             > (16,'tank2', '2019-06-23',"2019-06");
Query: INSERT INTO tank_order_hdfs (id,name,order_time,month) VALUES
(11,'test','2019-01-21',"2019-01"),
(12,'zhang','2019-02-12',"2019-02"),
(13,'tankzhang','2019-03-10',"2019-03"),
(14,'hao', '2019-04-15',"2019-04"),
(15,'tank1', '2019-05-09',"2019-05"),
(16,'tank2', '2019-06-23',"2019-06")
Query submitted at: 2020-01-14 17:43:20 (Coordinator: http://bigserver2:25000)
Query progress can be monitored at: http://bigserver2:25000/query_plan?query_id=85464b9cc23eb494:23e5752e00000000
Modified 6 row(s) in 3.54s

kudu表的数据,根hdfs表的数据,没有月份上重复。

5,创建视图

[bigserver2:21000] order_test> CREATE VIEW tank_order_view AS
                             > SELECT id,month,name,order_time
                             > FROM tank_order_kudu
                             > WHERE month >= "2019-07"
                             > UNION ALL
                             > SELECT id,month,name,order_time
                             > FROM tank_order_hdfs
                             > WHERE month < "2019-07";
Query: CREATE VIEW tank_order_view AS
SELECT id,month,name,order_time
FROM tank_order_kudu WHERE month >= "2019-07"
UNION ALL
SELECT id,month,name,order_time
FROM tank_order_hdfs
WHERE month < "2019-07"
+------------------------+
| summary                |
+------------------------+
| View has been created. |
+------------------------+
Fetched 1 row(s) in 0.05s

6,查看view和表数据

[bigserver2:21000] order_test> select * from tank_order_view order by id;
Query: select * from tank_order_view order by id
Query submitted at: 2020-01-14 17:44:39 (Coordinator: http://bigserver2:25000)
Query progress can be monitored at: http://bigserver2:25000/query_plan?query_id=7d4a82f5ff1a610c:2c92cb8d00000000
+-----+---------+-----------+---------------------+
| id  | month   | name      | order_time          |
+-----+---------+-----------+---------------------+
| 11  | 2019-01 | test      | 2019-01-21 00:00:00 |
| 12  | 2019-02 | zhang     | 2019-02-12 00:00:00 |
| 13  | 2019-03 | tankzhang | 2019-03-10 00:00:00 |
| 14  | 2019-04 | hao       | 2019-04-15 00:00:00 |
| 15  | 2019-05 | tank1     | 2019-05-09 00:00:00 |
| 16  | 2019-06 | tank2     | 2019-06-23 00:00:00 |
| 100 | 2019-07 | tank      | 2019-07-01 00:00:00 |
| 101 | 2019-07 | test      | 2019-07-21 00:00:00 |
| 102 | 2019-08 | zhang     | 2019-08-12 00:00:00 |
| 103 | 2019-09 | tankzhang | 2019-09-10 00:00:00 |
| 104 | 2019-10 | hao       | 2019-10-15 00:00:00 |
| 105 | 2019-11 | tank1     | 2019-11-09 00:00:00 |
| 106 | 2019-12 | tank2     | 2019-12-23 00:00:00 |
| 107 | 2020-01 | tank3     | 2020-01-11 00:00:00 |
+-----+---------+-----------+---------------------+
Fetched 14 row(s) in 3.73s

[bigserver2:21000] order_test> select * from tank_order_kudu;
Query: select * from tank_order_kudu
Query submitted at: 2020-01-14 17:45:04 (Coordinator: http://bigserver2:25000)
Query progress can be monitored at: http://bigserver2:25000/query_plan?query_id=0642df8221628792:066e868100000000
+-----+---------+-----------+---------------------+
| id  | month   | name      | order_time          |
+-----+---------+-----------+---------------------+
| 100 | 2019-07 | tank      | 2019-07-01 00:00:00 |
| 102 | 2019-08 | zhang     | 2019-08-12 00:00:00 |
| 101 | 2019-07 | test      | 2019-07-21 00:00:00 |
| 104 | 2019-10 | hao       | 2019-10-15 00:00:00 |
| 105 | 2019-11 | tank1     | 2019-11-09 00:00:00 |
| 106 | 2019-12 | tank2     | 2019-12-23 00:00:00 |
| 107 | 2020-01 | tank3     | 2020-01-11 00:00:00 |
| 103 | 2019-09 | tankzhang | 2019-09-10 00:00:00 |
+-----+---------+-----------+---------------------+
Fetched 8 row(s) in 0.14s

[bigserver2:21000] order_test> select * from tank_order_hdfs;
Query: select * from tank_order_hdfs
Query submitted at: 2020-01-14 17:45:09 (Coordinator: http://bigserver2:25000)
Query progress can be monitored at: http://bigserver2:25000/query_plan?query_id=45489970a4fece53:13c264f200000000
+----+-----------+---------------------+---------+
| id | name      | order_time          | month   |
+----+-----------+---------------------+---------+
| 12 | zhang     | 2019-02-12 00:00:00 | 2019-02 |
| 16 | tank2     | 2019-06-23 00:00:00 | 2019-06 |
| 15 | tank1     | 2019-05-09 00:00:00 | 2019-05 |
| 13 | tankzhang | 2019-03-10 00:00:00 | 2019-03 |
| 11 | test      | 2019-01-21 00:00:00 | 2019-01 |
| 14 | hao       | 2019-04-15 00:00:00 | 2019-04 |
+----+-----------+---------------------+---------+
Fetched 6 row(s) in 0.12s

二张表的数据,都能在视图中体现

7,查看表分区

[bigserver2:21000] order_test> show range partitions tank_order_kudu;
Query: show range partitions tank_order_kudu
+-------------------+
| RANGE (month)     |
+-------------------+
| VALUE = "2019-07" |
| VALUE = "2019-08" |
| VALUE = "2019-09" |
| VALUE = "2019-10" |
| VALUE = "2019-11" |
| VALUE = "2019-12" |
| VALUE = "2020-01" |
+-------------------+
Fetched 7 row(s) in 0.03s

[bigserver2:21000] order_test> show partitions tank_order_hdfs;
Query: show partitions tank_order_hdfs
+---------+-------+--------+--------+--------------+-------------------+---------+-------------------+--------------------------------------------------------------------------------------+
| month   | #Rows | #Files | Size   | Bytes Cached | Cache Replication | Format  | Incremental stats | Location                                                                             |
+---------+-------+--------+--------+--------------+-------------------+---------+-------------------+--------------------------------------------------------------------------------------+
| 2019-01 | -1    | 1      | 774B   | NOT CACHED   | NOT CACHED        | PARQUET | false             | hdfs://bigdata1/home/cdh6/hive/warehouse/order_test.db/tank_order_hdfs/month=2019-01 |
| 2019-02 | -1    | 1      | 779B   | NOT CACHED   | NOT CACHED        | PARQUET | false             | hdfs://bigdata1/home/cdh6/hive/warehouse/order_test.db/tank_order_hdfs/month=2019-02 |
| 2019-03 | -1    | 1      | 799B   | NOT CACHED   | NOT CACHED        | PARQUET | false             | hdfs://bigdata1/home/cdh6/hive/warehouse/order_test.db/tank_order_hdfs/month=2019-03 |
| 2019-04 | -1    | 1      | 769B   | NOT CACHED   | NOT CACHED        | PARQUET | false             | hdfs://bigdata1/home/cdh6/hive/warehouse/order_test.db/tank_order_hdfs/month=2019-04 |
| 2019-05 | -1    | 1      | 779B   | NOT CACHED   | NOT CACHED        | PARQUET | false             | hdfs://bigdata1/home/cdh6/hive/warehouse/order_test.db/tank_order_hdfs/month=2019-05 |
| 2019-06 | -1    | 1      | 779B   | NOT CACHED   | NOT CACHED        | PARQUET | false             | hdfs://bigdata1/home/cdh6/hive/warehouse/order_test.db/tank_order_hdfs/month=2019-06 |
| Total   | -1    | 6      | 4.57KB | 0B           |                   |         |                   |                                                                                      |
+---------+-------+--------+--------+--------------+-------------------+---------+-------------------+--------------------------------------------------------------------------------------+
Fetched 7 row(s) in 0.02s

8,准备impala sql脚本

# cat tank_order.sql
use order_test;   //选择数据库

//将kudu表某月数据插入到hdfs表,这个时候,二表有重复数据
INSERT INTO ${var:hdfs_table} PARTITION (month)
SELECT id,name,order_time,month
FROM ${var:kudu_table}
WHERE month = from_unixtime(unix_timestamp(add_months(now(), -6)),'yyyy-MM');

//优化表,impala查询更快
COMPUTE INCREMENTAL STATS ${var:hdfs_table};

//修改视图,这样解决数据重复问题
ALTER VIEW ${var:view_name} AS
SELECT id,month,name,order_time
FROM ${var:kudu_table}
WHERE month >= from_unixtime(unix_timestamp(add_months(now(), -5)),'yyyy-MM')
UNION ALL
SELECT id,month,name,order_time
FROM ${var:hdfs_table}
WHERE month < from_unixtime(unix_timestamp(add_months(now(), -5)),'yyyy-MM');

//删除kudu表旧分区,等于删除了kudu表中,在hdfs表中的数据
ALTER TABLE ${var:kudu_table}
DROP RANGE PARTITION VALUE=from_unixtime(unix_timestamp(add_months(now(), -6)),'yyyy-MM');

//增加kudu表新分区
ALTER TABLE ${var:kudu_table}
ADD RANGE PARTITION VALUE=from_unixtime(unix_timestamp(add_months(now(), +1)),'yyyy-MM');

9,执行脚本

[root@bigserver2 tanktest]# impala-shell -i bigserver2:21000 -f tank_order.sql --var=view_name=tank_order_view --var=kudu_table=tank_order_kudu --var=hdfs_table=tank_order_hdfs
Starting Impala Shell without Kerberos authentication
Opened TCP connection to bigserver2:21000
Connected to bigserver2:21000
Server version: impalad version 3.2.0-cdh6.3.1 RELEASE (build 3d5de689e9b9cfde0c01e1c64f4b4e830cee69c3)
Query: use order_test
Query: INSERT INTO tank_order_hdfs PARTITION (month)
SELECT id,name,order_time,month
FROM tank_order_kudu
WHERE month = from_unixtime(unix_timestamp(add_months(now(), -6)),'yyyy-MM')
Query submitted at: 2020-01-14 17:49:22 (Coordinator: http://bigserver2:25000)
Query progress can be monitored at: http://bigserver2:25000/query_plan?query_id=44439e6c6e711ac4:0105b2f900000000
Modified 2 row(s) in 0.67s
Query: COMPUTE INCREMENTAL STATS tank_order_hdfs
+-----------------------------------------+
| summary                                 |
+-----------------------------------------+
| Updated 7 partition(s) and 3 column(s). |
+-----------------------------------------+
Fetched 1 row(s) in 1.32s
Query: ALTER VIEW tank_order_view AS
SELECT id,month,name,order_time
FROM tank_order_kudu
WHERE month >= from_unixtime(unix_timestamp(add_months(now(), -5)),'yyyy-MM')
UNION ALL
SELECT id,month,name,order_time
FROM tank_order_hdfs
WHERE month < from_unixtime(unix_timestamp(add_months(now(), -5)),'yyyy-MM')
+------------------------+
| summary                |
+------------------------+
| View has been altered. |
+------------------------+
Fetched 1 row(s) in 0.07s
Query: ALTER TABLE tank_order_kudu
DROP RANGE PARTITION VALUE=from_unixtime(unix_timestamp(add_months(now(), -6)),'yyyy-MM')
+-----------------------------------+
| summary                           |
+-----------------------------------+
| Range partition has been dropped. |
+-----------------------------------+
Fetched 1 row(s) in 0.13s
Query: ALTER TABLE tank_order_kudu
ADD RANGE PARTITION VALUE=from_unixtime(unix_timestamp(add_months(now(), +1)),'yyyy-MM')
+---------------------------------+
| summary                         |
+---------------------------------+
| Range partition has been added. |
+---------------------------------+
Fetched 1 row(s) in 0.17s

10,执行后,查看数据

[bigserver2:21000] order_test> select * from tank_order_view order by id;
Query: select * from tank_order_view order by id
Query submitted at: 2020-01-14 17:50:50 (Coordinator: http://bigserver2:25000)
Query progress can be monitored at: http://bigserver2:25000/query_plan?query_id=0042af35d2363548:5e6046a400000000
+-----+---------+-----------+---------------------+
| id  | month   | name      | order_time          |
+-----+---------+-----------+---------------------+
| 11  | 2019-01 | test      | 2019-01-21 00:00:00 |
| 12  | 2019-02 | zhang     | 2019-02-12 00:00:00 |
| 13  | 2019-03 | tankzhang | 2019-03-10 00:00:00 |
| 14  | 2019-04 | hao       | 2019-04-15 00:00:00 |
| 15  | 2019-05 | tank1     | 2019-05-09 00:00:00 |
| 16  | 2019-06 | tank2     | 2019-06-23 00:00:00 |
| 100 | 2019-07 | tank      | 2019-07-01 00:00:00 |
| 101 | 2019-07 | test      | 2019-07-21 00:00:00 |
| 102 | 2019-08 | zhang     | 2019-08-12 00:00:00 |
| 103 | 2019-09 | tankzhang | 2019-09-10 00:00:00 |
| 104 | 2019-10 | hao       | 2019-10-15 00:00:00 |
| 105 | 2019-11 | tank1     | 2019-11-09 00:00:00 |
| 106 | 2019-12 | tank2     | 2019-12-23 00:00:00 |
| 107 | 2020-01 | tank3     | 2020-01-11 00:00:00 |
+-----+---------+-----------+---------------------+
Fetched 14 row(s) in 0.29s

[bigserver2:21000] order_test> select * from tank_order_kudu;
Query: select * from tank_order_kudu
Query submitted at: 2020-01-14 17:51:06 (Coordinator: http://bigserver2:25000)
Query progress can be monitored at: http://bigserver2:25000/query_plan?query_id=a14e07360ee03232:d8be8fe000000000
+-----+---------+-----------+---------------------+
| id  | month   | name      | order_time          |
+-----+---------+-----------+---------------------+
| 107 | 2020-01 | tank3     | 2020-01-11 00:00:00 |
| 104 | 2019-10 | hao       | 2019-10-15 00:00:00 |
| 103 | 2019-09 | tankzhang | 2019-09-10 00:00:00 |
| 102 | 2019-08 | zhang     | 2019-08-12 00:00:00 |
| 105 | 2019-11 | tank1     | 2019-11-09 00:00:00 |
| 106 | 2019-12 | tank2     | 2019-12-23 00:00:00 |
+-----+---------+-----------+---------------------+
Fetched 6 row(s) in 0.14s

[bigserver2:21000] order_test> select * from tank_order_hdfs;
Query: select * from tank_order_hdfs
Query submitted at: 2020-01-14 17:51:53 (Coordinator: http://bigserver2:25000)
Query progress can be monitored at: http://bigserver2:25000/query_plan?query_id=204219b3ecc74213:8bbc30bc00000000
+-----+-----------+---------------------+---------+
| id  | name      | order_time          | month   |
+-----+-----------+---------------------+---------+
| 12  | zhang     | 2019-02-12 00:00:00 | 2019-02 |
| 13  | tankzhang | 2019-03-10 00:00:00 | 2019-03 |
| 14  | hao       | 2019-04-15 00:00:00 | 2019-04 |
| 15  | tank1     | 2019-05-09 00:00:00 | 2019-05 |
| 16  | tank2     | 2019-06-23 00:00:00 | 2019-06 |
| 101 | test      | 2019-07-21 00:00:00 | 2019-07 |
| 100 | tank      | 2019-07-01 00:00:00 | 2019-07 |
| 11  | test      | 2019-01-21 00:00:00 | 2019-01 |
+-----+-----------+---------------------+---------+
Fetched 8 row(s) in 0.14s

从执行后的数据可以看出,视图数据是没有变的。kudu表的部分数据迁到了hdfs表中了。

11,查看分区

[bigserver2:21000] order_test> show range partitions tank_order_kudu;
Query: show range partitions tank_order_kudu
+-------------------+
| RANGE (month)     |
+-------------------+
| VALUE = "2019-08" |
| VALUE = "2019-09" |
| VALUE = "2019-10" |
| VALUE = "2019-11" |
| VALUE = "2019-12" |
| VALUE = "2020-01" |
| VALUE = "2020-02" |
+-------------------+
Fetched 7 row(s) in 0.05s

[bigserver2:21000] order_test> show partitions tank_order_hdfs;
Query: show partitions tank_order_hdfs
+---------+-------+--------+--------+--------------+-------------------+---------+-------------------+--------------------------------------------------------------------------------------+
| month   | #Rows | #Files | Size   | Bytes Cached | Cache Replication | Format  | Incremental stats | Location                                                                             |
+---------+-------+--------+--------+--------------+-------------------+---------+-------------------+--------------------------------------------------------------------------------------+
| 2019-01 | 1     | 1      | 774B   | NOT CACHED   | NOT CACHED        | PARQUET | true              | hdfs://bigdata1/home/cdh6/hive/warehouse/order_test.db/tank_order_hdfs/month=2019-01 |
| 2019-02 | 1     | 1      | 779B   | NOT CACHED   | NOT CACHED        | PARQUET | true              | hdfs://bigdata1/home/cdh6/hive/warehouse/order_test.db/tank_order_hdfs/month=2019-02 |
| 2019-03 | 1     | 1      | 799B   | NOT CACHED   | NOT CACHED        | PARQUET | true              | hdfs://bigdata1/home/cdh6/hive/warehouse/order_test.db/tank_order_hdfs/month=2019-03 |
| 2019-04 | 1     | 1      | 769B   | NOT CACHED   | NOT CACHED        | PARQUET | true              | hdfs://bigdata1/home/cdh6/hive/warehouse/order_test.db/tank_order_hdfs/month=2019-04 |
| 2019-05 | 1     | 1      | 779B   | NOT CACHED   | NOT CACHED        | PARQUET | true              | hdfs://bigdata1/home/cdh6/hive/warehouse/order_test.db/tank_order_hdfs/month=2019-05 |
| 2019-06 | 1     | 1      | 779B   | NOT CACHED   | NOT CACHED        | PARQUET | true              | hdfs://bigdata1/home/cdh6/hive/warehouse/order_test.db/tank_order_hdfs/month=2019-06 |
| 2019-07 | 2     | 1      | 798B   | NOT CACHED   | NOT CACHED        | PARQUET | true              | hdfs://bigdata1/home/cdh6/hive/warehouse/order_test.db/tank_order_hdfs/month=2019-07 |
| Total   | 8     | 7      | 5.35KB | 0B           |                   |         |                   |                                                                                      |
+---------+-------+--------+--------+--------------+-------------------+---------+-------------------+--------------------------------------------------------------------------------------+

kudu表,少了一个旧分区,多了一个新分区。

hdfs表,多了一个分区。



转载请注明
作者:海底苍鹰
地址:http://blog.51yip.com/hadoop/2348.html