impala kudu 创建 表

张映 发表于 2020-01-07

分类目录: hadoop/spark/scala

标签:, , ,

Kudu和Impala均是Cloudera贡献给Apache基金会的顶级项目。cdh中impala和kudu做了深度整合,性能方面,还是不错的。

一,Kudu是什么?

Kudu是围绕Hadoop生态圈建立存储引擎,Kudu拥有和Hadoop生态圈共同的设计理念,它运行在普通的服务器上、可分布式规模化部署、并且满足工业界的高可用要求。其设计理念为fast analytics on fast data.。Kudu的大部分场景和Hbase类似,其设计降低了随机读写性能,提高了扫描性能,在大部分场景下,Kudu在拥有接近Hbase的随机读写性能的同时,还有远超Hbase的扫描性能。

区别于Hbase等存储引擎,Kudu有如下优势:

1,快速的OLAP类查询处理速度
2,与MapReduce、Spark等Hadoop生态圈常见系统高度兼容,其连接驱动由官方支持维护
3,与Impala深度集成,相比HDFS+Parquet+Impala的传统架构,Kudu+Impala在绝大多数场景下拥有更好的性能。
4,强大而灵活的一致性模型,允许用户对每个请求单独定义一致性模型,甚至包括强序列一致性。
5,能够同时支持OLTP和OLAP请求,并且拥有良好的性能。
6,Kudu集成在ClouderaManager之中,对运维友好。
7,高可用。采用Raft Consensus算法来作为master失败后选举模型,即使选举失败,数据仍然是可读的。
8,支持结构化的数据,纯粹的列式存储,省空间的同时,提供更高效的查询速度。

二,Impala是什么

Impala是建立在Hadoop生态圈的交互式SQL解析引擎,Impala的SQL语法与Hive高度兼容,并且提供标准的ODBC和JDBC接口。Impala本身不提供数据的存储服务,其底层数据可来自HDFS、Kudu、Hbase甚至亚马逊S3。

Impapa最早由Cloudera公司开发,于15年12月贡献给Apache基金会,目前其正式名字为Apache Impala(incubating)

Impala本身并不是Hive的完全替代品,对于一些大吞吐量长时间执行的请求,Hive仍然是最稳定最佳的选择,哪怕是SparkSQL,其稳定性也无法跟Hive媲美。

稳定性方面Impala不如Hive,但是在执行效率方面,Impala毫无疑问可以秒杀Hive。Impala采用内存计算模型,对于分布式Shuffle,可以尽可能的利用现代计算机的内存和CPU资源。同时,Impala也有预处理和分析技术,表数据插入之后可以用COMPUTE STATS指令来让Impala对行列数据深度分析。

Impala的优势

1,和Hive高度相似的SQL语法,无需太多学习成本
2,超大数据规模SQL解析的能力,高效利用内存与CPU利用,快速返回SQL查询结果。
3,集成多个底层数据源,HDFS、Kudu、Hbase等数据皆可通过Impala共享,并且无需进行数据同步。
4,与Hue深度集成,提供可视化的SQL操作以及work flow。
5,提供标准JDBC和ODBC接口,方便下游业务方无缝接入。
6,提供最多细化到列的权限管理,满足实际生产环境数据安全要求。

三,创建kudu表

1,执行impala-shell

[root@bigserver2 ~]# netstat -tpnl |grep 21000  //查看21000端口,启动了
tcp6 0 0 :::21000 :::* LISTEN 8307/impalad  

[root@bigserver2 ~]# impala-shell
Starting Impala Shell without Kerberos authentication
Opened TCP connection to bigserver2:21000
Connected to bigserver2:21000
Server version: impalad version 3.2.0-cdh6.3.1 RELEASE (build 3d5de689e9b9cfde0c01e1c64f4b4e830cee69c3)
***********************************************************************************
Welcome to the Impala shell.
(Impala Shell v3.2.0-cdh6.3.1 (3d5de68) built on Thu Sep 26 03:03:39 PDT 2019)

When you set a query option it lasts for the duration of the Impala shell session.
***********************************************************************************
[bigserver2:21000] default> show databases;
Query: show databases
+------------------+----------------------------------------------+
| name             | comment                                      |
+------------------+----------------------------------------------+
| _impala_builtins | System database for Impala builtin functions |
| default          | Default Hive database                        |
| parquet_test     |                                              |
| tanktest         |                                              |
+------------------+----------------------------------------------+
Fetched 4 row(s) in 0.62s

2,创建kudu内部表

[bigserver2:21000] parquet_test> CREATE TABLE tank_test_2
                               > (
                               >   id BIGINT,
                               >   name STRING,
                               >   time TIMESTAMP,
                               >   message STRING,
                               >   PRIMARY KEY(id)
                               > )
                               > PARTITION BY HASH PARTITIONS 16 STORED AS KUDU;
Query: CREATE TABLE tank_test_2
(
  id BIGINT,
  name STRING,
  time TIMESTAMP,
  message STRING,
  PRIMARY KEY(id)
)
PARTITION BY HASH PARTITIONS 16 STORED AS KUDU
+-------------------------+
| summary                 |
+-------------------------+
| Table has been created. |
+-------------------------+
Fetched 1 row(s) in 1.08s

[bigserver2:21000] parquet_test> desc formatted tank_test_2;
Query: describe formatted tank_test_2
+------------------------------+----------------------------------------------------------------------+------------------------------------------------+
| name                         | type                                                                 | comment                                        |
+------------------------------+----------------------------------------------------------------------+------------------------------------------------+
| # col_name                   | data_type                                                            | comment                                        |
|                              | NULL                                                                 | NULL                                           |
| id                           | bigint                                                               | NULL                                           |
| name                         | string                                                               | NULL                                           |
| time                         | timestamp                                                            | NULL                                           |
| message                      | string                                                               | NULL                                           |
|                              | NULL                                                                 | NULL                                           |
| # Detailed Table Information | NULL                                                                 | NULL                                           |
| Database:                    | parquet_test                                                         | NULL                                           |
| OwnerType:                   | USER                                                                 | NULL                                           |
| Owner:                       | root                                                                 | NULL                                           |
| CreateTime:                  | Tue Jan 07 12:58:46 CST 2020                                         | NULL                                           |
| LastAccessTime:              | UNKNOWN                                                              | NULL                                           |
| Retention:                   | 0                                                                    | NULL                                           |
| Location:                    | hdfs://bigdata1/home/cdh6/hive/warehouse/parquet_test.db/tank_test_2 | NULL                                           |
| Table Type:                  | MANAGED_TABLE                                                        | NULL                                           |
| Table Parameters:            | NULL                                                                 | NULL                                           |
|                              | kudu.master_addresses                                                | bigserver1,bigserver2,bigserver3,bigserver4    |
|                              | kudu.table_name                                                      | impala::parquet_test.tank_test_2               |
|                              | storage_handler                                                      | org.apache.hadoop.hive.kudu.KuduStorageHandler |
|                              | transient_lastDdlTime                                                | 1578373126                                     |
|                              | NULL                                                                 | NULL                                           |
| # Storage Information        | NULL                                                                 | NULL                                           |
| SerDe Library:               | org.apache.hadoop.hive.kudu.KuduSerDe                                | NULL                                           |
| InputFormat:                 | org.apache.hadoop.hive.kudu.KuduInputFormat                          | NULL                                           |
| OutputFormat:                | org.apache.hadoop.hive.kudu.KuduOutputFormat                         | NULL                                           |
| Compressed:                  | No                                                                   | NULL                                           |
| Num Buckets:                 | 0                                                                    | NULL                                           |
| Bucket Columns:              | []                                                                   | NULL                                           |
| Sort Columns:                | []                                                                   | NULL                                           |
+------------------------------+----------------------------------------------------------------------+------------------------------------------------+

上面的kudu.table_name,创建外部表的时候会用的到。

3,创建外部表

[bigserver2:21000] parquet_test> CREATE external TABLE tank_test_1
                               > STORED AS KUDU
                               > TBLPROPERTIES (
                               >   'kudu.master_addresses' = 'bigserver2,bigserver3,bigserver5',
                               >   'kudu.table_name' = 'impala::parquet_test.tank_test_2'
                               > );
Query: CREATE external TABLE tank_test_1
STORED AS KUDU
TBLPROPERTIES (
  'kudu.master_addresses' = 'bigserver2,bigserver3,bigserver5',
  'kudu.table_name' = 'impala::parquet_test.tank_test_2'
)
+-------------------------+
| summary                 |
+-------------------------+
| Table has been created. |
+-------------------------+
Fetched 1 row(s) in 0.38s

[bigserver2:21000] parquet_test> desc formatted tank_test_1;
Query: describe formatted tank_test_1
+------------------------------+----------------------------------------------------------------------+------------------------------------------------+
| name                         | type                                                                 | comment                                        |
+------------------------------+----------------------------------------------------------------------+------------------------------------------------+
| # col_name                   | data_type                                                            | comment                                        |
|                              | NULL                                                                 | NULL                                           |
| id                           | bigint                                                               | NULL                                           |
| name                         | string                                                               | NULL                                           |
| time                         | timestamp                                                            | NULL                                           |
| message                      | string                                                               | NULL                                           |
|                              | NULL                                                                 | NULL                                           |
| # Detailed Table Information | NULL                                                                 | NULL                                           |
| Database:                    | parquet_test                                                         | NULL                                           |
| OwnerType:                   | USER                                                                 | NULL                                           |
| Owner:                       | root                                                                 | NULL                                           |
| CreateTime:                  | Tue Jan 07 13:43:52 CST 2020                                         | NULL                                           |
| LastAccessTime:              | UNKNOWN                                                              | NULL                                           |
| Retention:                   | 0                                                                    | NULL                                           |
| Location:                    | hdfs://bigdata1/home/cdh6/hive/warehouse/parquet_test.db/tank_test_1 | NULL                                           |
| Table Type:                  | EXTERNAL_TABLE                                                       | NULL                                           |
| Table Parameters:            | NULL                                                                 | NULL                                           |
|                              | EXTERNAL                                                             | TRUE                                           |
|                              | kudu.master_addresses                                                | bigserver2,bigserver3,bigserver5               |
|                              | kudu.table_name                                                      | impala::parquet_test.tank_test_2               |
|                              | numFiles                                                             | 0                                              |
|                              | storage_handler                                                      | org.apache.hadoop.hive.kudu.KuduStorageHandler |
|                              | totalSize                                                            | 0                                              |
|                              | transient_lastDdlTime                                                | 1578375832                                     |
|                              | NULL                                                                 | NULL                                           |
| # Storage Information        | NULL                                                                 | NULL                                           |
| SerDe Library:               | org.apache.hadoop.hive.kudu.KuduSerDe                                | NULL                                           |
| InputFormat:                 | org.apache.hadoop.hive.kudu.KuduInputFormat                          | NULL                                           |
| OutputFormat:                | org.apache.hadoop.hive.kudu.KuduOutputFormat                         | NULL                                           |
| Compressed:                  | No                                                                   | NULL                                           |
| Num Buckets:                 | 0                                                                    | NULL                                           |
| Bucket Columns:              | []                                                                   | NULL                                           |
| Sort Columns:                | []                                                                   | NULL                                           |
+------------------------------+----------------------------------------------------------------------+------------------------------------------------+
Fetched 33 row(s) in 0.04s

内部表由Impala管理,当您从Impala中删除时,数据和表确实被删除。当您使用Impala创建新表时,它通常是内部表。

外部表(创建者CREATE EXTERNAL TABLE)不受Impala管理,并且删除此表不会将表从其源位置(此处为Kudu)丢弃。相反,它只会去除Impala和Kudu之间的映射。这是Kudu提供的用于将现有表映射到Impala的语法。

创建表时,经常遇到的问题

ERROR: AnalysisException: Columns cannot be specified with an external Kudu table.

原因:创建外部表,不能指定表字段

ERROR: AnalysisException: Not allowed to set 'kudu.table_name' manually for managed Kudu tables .

原因:创建内部表的时候,加了TBLPROPERTIE的kudu.table_name属性

ERROR: ImpalaRuntimeException: Error loading schema of table 'parquet_test.tank_test_2'

原因:创建外部表时,找不到kudu表,impala::parquet_test.tank_test_2 

4,测试内外部表

[bigserver2:21000] parquet_test> INSERT INTO tank_test_2 VALUES (1, "tank", "2020-01-07","test");
Query: INSERT INTO tank_test_2 VALUES (1, "tank", "2020-01-07","test")
Query submitted at: 2020-01-07 15:03:19 (Coordinator: http://bigserver2:25000)
Query progress can be monitored at: http://bigserver2:25000/query_plan?query_id=324fbb145fe19b36:b38bcd1100000000
Modified 1 row(s), 0 row error(s) in 0.89s

[bigserver2:21000] parquet_test> select * from tank_test_2;
Query: select * from tank_test_2
Query submitted at: 2020-01-07 15:03:39 (Coordinator: http://bigserver2:25000)
Query progress can be monitored at: http://bigserver2:25000/query_plan?query_id=da440d9f1f620951:7d43e3e600000000
+----+------+---------------------+---------+
| id | name | time                | message |
+----+------+---------------------+---------+
| 1  | tank | 2020-01-07 00:00:00 | test    |
+----+------+---------------------+---------+
Fetched 1 row(s) in 1.67s

[bigserver2:21000] parquet_test> select * from tank_test_1;
Query: select * from tank_test_1
Query submitted at: 2020-01-07 15:03:51 (Coordinator: http://bigserver2:25000)
Query progress can be monitored at: http://bigserver2:25000/query_plan?query_id=274231273a6d349d:b8d1dac200000000
+----+------+---------------------+---------+
| id | name | time                | message |
+----+------+---------------------+---------+
| 1  | tank | 2020-01-07 00:00:00 | test    |
+----+------+---------------------+---------+
Fetched 1 row(s) in 0.24s

[bigserver2:21000] parquet_test> UPSERT INTO tank_test_2 VALUES (1, "tank", "2020-01-07","tank");  //upsert语法,第一次见
Query: UPSERT INTO tank_test_2 VALUES (1, "tank", "2020-01-07","tank")
Query submitted at: 2020-01-07 15:05:58 (Coordinator: http://bigserver2:25000)
Query progress can be monitored at: http://bigserver2:25000/query_plan?query_id=1d459fc8db28e9a5:47ae51b100000000
Modified 1 row(s), 0 row error(s) in 0.12s

[bigserver2:21000] parquet_test> select * from tank_test_2;
Query: select * from tank_test_2
Query submitted at: 2020-01-07 15:06:01 (Coordinator: http://bigserver2:25000)
Query progress can be monitored at: http://bigserver2:25000/query_plan?query_id=e84a7ef56dd76146:ae0fccc500000000
+----+------+---------------------+---------+
| id | name | time                | message |
+----+------+---------------------+---------+
| 1  | tank | 2020-01-07 00:00:00 | tank    |
+----+------+---------------------+---------+
Fetched 1 row(s) in 0.18s

内部表数据的变动,外部表根着变动。可以把外部表理解成带增,删,改功能的视图。但是有一点不同的是,删除外部表,内部表不受影响。

四,表分区

根据主键列上的分区模式将表格划分为 tablets。每个 tablet 由至少一台 tablet server 提供。理想情况下,a table 应该分成 tablets ,分散在多个服务器上,以最大化并行运行。您使用的分区方案的详细信息将完全取决于您存储的数据类型和访问方式。

1,range分区

PARTITION BY RANGE
可以为一个或多个主键列指定范围分区。Kudu中的范围分区允许根据所选分区键的特定值或值的范围拆分表。这使您能够平衡写入中的并行性与扫描效率。

[bigserver2:21000] parquet_test> CREATE TABLE tank_test_3
                               > (
                               >   id BIGINT,
                               >   year INT,
                               >   name STRING,
                               >   PRIMARY KEY(id,year)
                               > ) PARTITION BY RANGE(year) (
                               >       PARTITION 2000 <= VALUES < 2005, >
                                      PARTITION 2005 <= VALUES < 2010, >
                                      PARTITION 2010 <= VALUES < 2015, >
                                      PARTITION 2015 <= VALUES < 2020 > )
                                      STORED AS KUDU;
Query: CREATE TABLE tank_test_3
(
  id BIGINT,
  year INT,
  name STRING,
  PRIMARY KEY(id,year)
) PARTITION BY RANGE(year) (
      PARTITION 2000 <= VALUES < 2005,
      PARTITION 2005 <= VALUES < 2010,
      PARTITION 2010 <= VALUES < 2015,
      PARTITION 2015 <= VALUES < 2020
) STORED AS KUDU
+-------------------------+
| summary                 |
+-------------------------+
| Table has been created. |
+-------------------------+
Fetched 1 row(s) in 0.54s

主键从上到下,多个还要连续。
主键从上到下,多个还要连续。
主键从上到下,多个还要连续。

2,hash分区

PARTITION BY HASH
若不是通过明确的范围分发,或者与范围分布相结合,您可以通过哈希分发到特定数量的“buckets”。指定您要分区的主键列以及要使用的存储桶数。通过散列指定的键列来分配行。假设散列的值本身并不表现出显着的偏差,这将有助于将数据均匀地分布在数据桶之间。

您可以指定多个定义,您可以指定使用复合主键的定义。但是,在多个散列定义中不能提及一列。Consider two columns, a and b: *  HASH(a), HASH(b) * 、 HASH(a,b) *   后面两个是错误的表达:HASH(a), HASH(a,b)。如果主键值均匀分布在其域中,并且数据偏移不明显,例如时间戳或序列号,则哈希分区是一种合理的方法。

[bigserver2:21000] parquet_test> CREATE TABLE tank_test_4
                               > (
                               >   name STRING,
                               >   year INT,
                               >   PRIMARY KEY(name)
                               > ) PARTITION BY HASH(name) PARTITIONS 12
                               > STORED AS KUDU;
Query: CREATE TABLE tank_test_4
(
  name STRING,
  year INT,
  PRIMARY KEY(name)
) PARTITION BY HASH(name) PARTITIONS 12  //单个
STORED AS KUDU
+-------------------------+
| summary                 |
+-------------------------+
| Table has been created. |
+-------------------------+
Fetched 1 row(s) in 0.69s

[bigserver2:21000] parquet_test> CREATE TABLE tank_test_5
                               > (
                               >   id BIGINT,
                               >   name STRING,
                               >   year INT,
                               >   PRIMARY KEY(id,name)
                               > ) PARTITION BY HASH(id,name) PARTITIONS 16
                               >  STORED AS KUDU;
Query: CREATE TABLE tank_test_5
(
  id BIGINT,
  name STRING,
  year INT,
  PRIMARY KEY(id,name)
) PARTITION BY HASH(id,name) PARTITIONS 16  //多个
 STORED AS KUDU
+-------------------------+
| summary                 |
+-------------------------+
| Table has been created. |
+-------------------------+
Fetched 1 row(s) in 0.76s

3,hash和range混合分区

PARTITION BY HASH 和 RANGE
可以组合HASH和RANGE分区以创建更复杂的分区模式。可以指定零个或多个HASH定义,后跟零或一个RANGE定义。每个定义可以包含一个或多个列。

以下示例仍然会创建20个tablets,首先将id列分成5个存储区,然后根据year字符串的值将应用范围分区拆分为4个tablets。文字分散在至少四片(最多可达20张)。当您查询连续的year值范围时,很可能只需要从四分之一的电脑中读取即可完成查询。

[bigserver2:21000] parquet_test> CREATE TABLE tank_test_6
                               > (
                               >   id BIGINT,
                               >   year INT,
                               >   name STRING,
                               >   PRIMARY KEY(id,year)
                               > )
                               > PARTITION BY HASH(id) PARTITIONS 5,
                               > RANGE(year) (
                               >       PARTITION 2000 <= VALUES < 2005, >
                                       PARTITION 2005 <= VALUES < 2010, >
                                       PARTITION 2010 <= VALUES < 2015, >
                                       PARTITION 2015 <= VALUES < 2020 > )
                                       STORED AS KUDU;
Query: CREATE TABLE tank_test_6
(
  id BIGINT,
  year INT,
  name STRING,
  PRIMARY KEY(id,year)
)
PARTITION BY HASH(id) PARTITIONS 5,
RANGE(year) (
      PARTITION 2000 <= VALUES < 2005,
      PARTITION 2005 <= VALUES < 2010,
      PARTITION 2010 <= VALUES < 2015,
      PARTITION 2015 <= VALUES < 2020
) STORED AS KUDU
+-------------------------+
| summary                 |
+-------------------------+
| Table has been created. |
+-------------------------+
Fetched 1 row(s) in 1.12s

[bigserver2:21000] parquet_test> show partitions tank_test_6;
Query: show partitions tank_test_6
+--------+------------------+------------------+-----------------+------------+
| # Rows | Start Key        | Stop Key         | Leader Replica  | # Replicas |
+--------+------------------+------------------+-----------------+------------+
| -1     | 00000000800007D0 | 00000000800007D5 | bigserver3:7050 | 3          |
| -1     | 00000000800007D5 | 00000000800007DA | bigserver5:7050 | 3          |
| -1     | 00000000800007DA | 00000000800007DF | bigserver2:7050 | 3          |
| -1     | 00000000800007DF | 00000000800007E4 | bigserver5:7050 | 3          |
| -1     | 00000001800007D0 | 00000001800007D5 | bigserver5:7050 | 3          |
| -1     | 00000001800007D5 | 00000001800007DA | bigserver5:7050 | 3          |
| -1     | 00000001800007DA | 00000001800007DF | bigserver3:7050 | 3          |
| -1     | 00000001800007DF | 00000001800007E4 | bigserver1:7050 | 3          |
| -1     | 00000002800007D0 | 00000002800007D5 | bigserver1:7050 | 3          |
| -1     | 00000002800007D5 | 00000002800007DA | bigserver2:7050 | 3          |
| -1     | 00000002800007DA | 00000002800007DF | bigserver1:7050 | 3          |
| -1     | 00000002800007DF | 00000002800007E4 | bigserver3:7050 | 3          |
| -1     | 00000003800007D0 | 00000003800007D5 | bigserver5:7050 | 3          |
| -1     | 00000003800007D5 | 00000003800007DA | bigserver5:7050 | 3          |
| -1     | 00000003800007DA | 00000003800007DF | bigserver3:7050 | 3          |
| -1     | 00000003800007DF | 00000003800007E4 | bigserver1:7050 | 3          |
| -1     | 00000004800007D0 | 00000004800007D5 | bigserver4:7050 | 3          |
| -1     | 00000004800007D5 | 00000004800007DA | bigserver4:7050 | 3          |
| -1     | 00000004800007DA | 00000004800007DF | bigserver4:7050 | 3          |
| -1     | 00000004800007DF | 00000004800007E4 | bigserver2:7050 | 3          |
+--------+------------------+------------------+-----------------+------------+
Fetched 20 row(s) in 0.20s

4,非覆盖范围分区

Kudu 1.0及更高版本支持使用非覆盖范围分区,其解决方案如下:

非覆盖的 range 范围分区,在需要考虑不断增加的主键的时间序列数据或其他模式的情况下,服务旧数据的tablets的大小相对固定,而接收新数据的tablet 将不受限制地增长。

如果要根据其类别(如销售区域或产品类型)对数据进行分区,而不使用未覆盖的范围分区,则必须提前知道所有分区,或者如果要添加或删除分区,请手动重新创建表。 例如引入或消除产品类型。

[bigserver2:21000] parquet_test> show partitions tank_test_3;
Query: show partitions tank_test_3
+--------+-----------+----------+-----------------+------------+
| # Rows | Start Key | Stop Key | Leader Replica  | # Replicas |
+--------+-----------+----------+-----------------+------------+
| -1     | 800007D0  | 800007D5 | bigserver4:7050 | 3          |
| -1     | 800007D5  | 800007DA | bigserver3:7050 | 3          |
| -1     | 800007DA  | 800007DF | bigserver2:7050 | 3          |
| -1     | 800007DF  | 800007E4 | bigserver5:7050 | 3          |
+--------+-----------+----------+-----------------+------------+
Fetched 4 row(s) in 4.37s

[bigserver2:21000] parquet_test> ALTER TABLE tank_test_3 ADD RANGE PARTITION 2020 <= VALUES < 2025;
Query: ALTER TABLE tank_test_3 ADD RANGE PARTITION 2020 <= VALUES < 2025
+---------------------------------+
| summary                         |
+---------------------------------+
| Range partition has been added. |
+---------------------------------+
Fetched 1 row(s) in 0.35s

[bigserver2:21000] parquet_test> show partitions tank_test_3;
Query: show partitions tank_test_3
+--------+-----------+----------+-----------------+------------+
| # Rows | Start Key | Stop Key | Leader Replica  | # Replicas |
+--------+-----------+----------+-----------------+------------+
| -1     | 800007D0  | 800007D5 | bigserver4:7050 | 3          |
| -1     | 800007D5  | 800007DA | bigserver3:7050 | 3          |
| -1     | 800007DA  | 800007DF | bigserver2:7050 | 3          |
| -1     | 800007DF  | 800007E4 | bigserver5:7050 | 3          |
| -1     | 800007E4  | 800007E9 | bigserver4:7050 | 3          |
+--------+-----------+----------+-----------------+------------+
Fetched 5 row(s) in 0.05s

五,Kudu以及Impala的不足

1,kudu不足

Kudu主键的限制

表创建后主键不可更改;
一行对应的主键内容不可以被Update操作修改。要修改一行的主键值,需要删除并新增一行新数据,并且该操作无法保持原子性;
主键的类型不支持DOUBLE、FLOAT、BOOL,并且主键必须是非空的(NOT NULL);
自动生成的主键是不支持的;
每行对应的主键存储单元(CELL)最大为16KB。

Kudu列的限制

MySQL中的部分数据类型,如DECIMAL, CHAR, VARCHAR, DATE, ARRAY等不支持;
数据类型以及是否可为空等列属性不支持修改;
一张表最多有300列。

Kudu表的限制

表的备份数必须为奇数,最大为7;
备份数在设置后不可修改。

Kudu单元(Cells)的限制

单元对应的数据最大为64KB,并且是在压缩前。

Kudu分片的限制

分片只支持手动指定,自动分片不支持;
分片设定不支持修改,修改分片设定需要”建新表-导数据-删老表”操作;
丢掉多数备份的Tablets需要手动修复。

Kudu容量限制

建议tablet servers的最大数量为100;
建议masters的最大数量为3;
建议每个tablet server存储的数据最大为4T(此处存疑,为何会有4T这么小的限制?);
每个tablet server存储的tablets数量建议在1000以内;
每个表分片后的tablets存储在单个tablet server的最大数量为60。

Kudu其他使用限制

Kudu被设计为分析的用途,每行对应的数据太大可能会碰到一些问题;
主键有索引,不支持二级索引(Secondary indexes);
多行的事务操作不支持;
关系型数据的一些功能,如外键,不支持;
列和表的名字强制为UTF-8编码,并且最大256字节;
删除一列并不会马上释放空间,需要执行Compaction操作,但是Compaction操作不支持手动执行;
删除表的操作会立刻释放空间。

2,impala不足

Impala的稳定性

Impala不适合超长时间的SQL请求;
Impala不支持高并发读写操作,即使Kudu是支持的;
Impala和Hive有部分语法不兼容。



转载请注明
作者:海底苍鹰
地址:http://blog.51yip.com/hadoop/2345.html/comment-page-1

1 条评论

  1. aa 留言

    20年写的文章,你竟然说kudu不支持DECIMAL类型?
    看看这个:https://kudu.apache.org/docs/schema_design.html#decimal