hbase filter 简介与实例

张映 发表于 2019-10-09

分类目录: hadoop/spark/scala

标签:, ,

HBase 不仅提供了这些简单的查询,而且提供了更加高级的过滤器(Filter)来查询。

过滤器可以根据列族、列、版本等更多的条件来对数据进行过滤,基于 HBase 本身提供的三维有序(行键,列,版本有序),这些过滤器可以高效地完成查询过滤的任务,带有过滤器条件的 RPC 查询请求会把过滤器分发到各个 RegionServer(这是一个服务端过滤器),这样也可以降低网络传输的压力。

a,查看自带filters

hbase(main):017:0> show_filters
DependentColumnFilter
KeyOnlyFilter
ColumnCountGetFilter
SingleColumnValueFilter
PrefixFilter
SingleColumnValueExcludeFilter
FirstKeyOnlyFilter
ColumnRangeFilter
TimestampsFilter
FamilyFilter
QualifierFilter
ColumnPrefixFilter
RowFilter
MultipleColumnPrefixFilter
InclusiveStopFilter
PageFilter
ValueFilter
ColumnPaginationFilter

b,准备测试数据

hbase(main):021:0> scan 'test_ns:user'
ROW COLUMN+CELL
 1000120190925 column=contact:mobile, timestamp=1570591004923, value=15854654215
 1000120190925 column=contact:tel, timestamp=1570591006670, value=02112345678
 1000120190925 column=info:age, timestamp=1570591004901, value=35
 1000120190925 column=info:sex, timestamp=1570591004880, value=Male
 1000120190925 column=login:password, timestamp=1570591004852, value=Pass1
 1000120190925 column=login:username, timestamp=1570591004809, value=Tank
 1000220190926 column=contact:mobile, timestamp=1569461617446, value=15812345678
 1000220190926 column=contact:tel, timestamp=1569461617469, value=02122345678
 1000220190926 column=info:age, timestamp=1569461617408, value=35
 1000220190926 column=login:password, timestamp=1569461617362, value=pass1
 1000220190926 column=login:username, timestamp=1569468833657, value=test1_111
 1000320190926 column=contact:mobile, timestamp=1569461617577, value=15822345678
 1000320190926 column=contact:tel, timestamp=1569461617594, value=02132345678
 1000320190926 column=info:age, timestamp=1569461617561, value=36
 1000320190926 column=info:sex, timestamp=1569461617538, value=male2
 1000320190926 column=login:password, timestamp=1569461617521, value=pass2
 1000320190926 column=login:username, timestamp=1569463065916, value=newtest2
 1000420190926 column=contact:mobile, timestamp=1569461617701, value=15833345678
 1000420190926 column=contact:tel, timestamp=1569461617717, value=02142345678
 1000420190926 column=info:age, timestamp=1569461617683, value=37
 1000420190926 column=info:sex, timestamp=1569461617666, value=male3
 1000420190926 column=login:password, timestamp=1569461617649, value=pass3
 1000420190926 column=login:username, timestamp=1569461617628, value=test3
 1000520190926 column=contact:mobile, timestamp=1569461709251, value=15803345678
 1000520190926 column=contact:tel, timestamp=1569461709916, value=02152345678
 1000520190926 column=info:age, timestamp=1569461709231, value=38
 1000520190926 column=info:sex, timestamp=1569461709213, value=none
 1000520190926 column=login:password, timestamp=1569461709200, value=admin
 1000520190926 column=login:username, timestamp=1569461709184, value=zhang
 1000620191009 column=info:sex, timestamp=1570592324271, value=
 1000620191009 column=login:password, timestamp=1570592323845, value=test
 1000620191009 column=login:username, timestamp=1570592323820, value=tank
6 row(s) in 0.0190 seconds

一,keyOnlyFilter返回的列值全部为空

1,语法

KeyOnlyFilter()

2,样例

hbase(main):023:0* import org.apache.hadoop.hbase.filter.KeyOnlyFilter;
hbase(main):024:0* scan 'test_ns:user', {FILTER=>KeyOnlyFilter.new()}
ROW COLUMN+CELL
 1000120190925 column=contact:mobile, timestamp=1570591004923, value=
 1000120190925 column=contact:tel, timestamp=1570591006670, value=
 1000120190925 column=info:age, timestamp=1570591004901, value=
 1000120190925 column=info:sex, timestamp=1570591004880, value=
 1000120190925 column=login:password, timestamp=1570591004852, value=
 1000120190925 column=login:username, timestamp=1570591004809, value=
。。。。。。。。。。。。。。。。。。。。。。。。省略。。。。。。。。。。。。。。。。。。。。

在这里要注意哦,如果有10条数据,8条有值,2条没值。keyOnlyFilter返回10条空数据,而不是2条。

二,FirstKeyOnlyFilter返回每行的第一列

1,语法

FirstKeyOnlyFilter()

2,样例

hbase(main):025:0> import org.apache.hadoop.hbase.filter.FirstKeyOnlyFilter;
hbase(main):026:0* scan 'test_ns:user',{FILTER=>FirstKeyOnlyFilter.new()}
ROW COLUMN+CELL videv
 1000120190925 column=contact:mobile, timestamp=1570591004923, value=15854654215
 1000220190926 column=contact:mobile, timestamp=1569461617446, value=15812345678
 1000320190926 column=contact:mobile, timestamp=1569461617577, value=15822345678
 1000420190926 column=contact:mobile, timestamp=1569461617701, value=15833345678
 1000520190926 column=contact:mobile, timestamp=1569461709251, value=15803345678
 1000620191009 column=info:sex, timestamp=1570592324271, value=

三,PrefixFilter根据rowkey的前缀过滤行

1,语法

PrefixFilter (‘<row_prefix>’)

2,样例,rowkey以10001开头

hbase(main):009:0> import org.apache.hadoop.hbase.filter.PrefixFilter;
hbase(main):010:0* import org.apache.hadoop.hbase.util.Bytes;
hbase(main):011:0* scan 'test_ns:user',{FILTER=>PrefixFilter.new(Bytes.toBytes('10001'))}
ROW COLUMN+CELL
 1000120190925 column=contact:mobile, timestamp=1570591004923, value=15854654215
 1000120190925 column=contact:tel, timestamp=1570591006670, value=02112345678
 1000120190925 column=info:age, timestamp=1570591004901, value=35
 1000120190925 column=info:sex, timestamp=1570591004880, value=Male
 1000120190925 column=login:password, timestamp=1570591004852, value=Pass1
 1000120190925 column=login:username, timestamp=1570591004809, value=Tank
1 row(s) in 0.2900 seconds

hbase(main):012:0> scan 'test_ns:user',{FILTER=>PrefixFilter.new(Bytes.toBytes('20191009'))}
ROW COLUMN+CELL
0 row(s) in 0.0130 seconds

hbase(main):013:0> scan 'test_ns:user',{FILTER=>PrefixFilter.new(Bytes.toBytes('1000620191009'))}
ROW COLUMN+CELL
 1000620191009 column=info:sex, timestamp=1570592324271, value=
 1000620191009 column=login:password, timestamp=1570592323845, value=test
 1000620191009 column=login:username, timestamp=1570592323820, value=tank
1 row(s) in 0.0110 seconds

注意哦:可模糊匹配

四,ColumnPrefixFilter返回满足条件的列

1,语法

ColumnPrefixFilter (‘<column_prefix>’)

2,样例,以u开头,或者sex开头的列

hbase(main):015:0> import org.apache.hadoop.hbase.filter.ColumnPrefixFilter
=> Java::OrgApacheHadoopHbaseFilter::ColumnPrefixFilter
hbase(main):016:0> import org.apache.hadoop.hbase.util.Bytes
hbase(main):017:0> scan 'test_ns:user',{STARTROW=>'1000120190925',STOPROW=>'1000620190926',FILTER=>ColumnPrefixFilter.new(Bytes.toBytes('u'))}
ROW COLUMN+CELL
 1000120190925 column=login:username, timestamp=1570591004809, value=Tank
 1000220190926 column=login:username, timestamp=1569468833657, value=test1_111
 1000320190926 column=login:username, timestamp=1569463065916, value=newtest2
 1000420190926 column=login:username, timestamp=1569461617628, value=test3
 1000520190926 column=login:username, timestamp=1569461709184, value=zhang 

hbase(main):027:0> scan 'test_ns:user',{STARTROW=>'1000120190925',STOPROW=>'1000620190926',FILTER=>"ColumnPrefixFilter('sex')"}
ROW COLUMN+CELL
 1000120190925 column=info:sex, timestamp=1570591004880, value=Male
 1000320190926 column=info:sex, timestamp=1569461617538, value=male2
 1000420190926 column=info:sex, timestamp=1569461617666, value=male3
 1000520190926 column=info:sex, timestamp=1569461709213, value=none

五,multipleColumnPrefixFilter根ColumnPrefixFilter差不多,只不过是多列

1,语法

MultipleColumnPrefixFilter (‘<column_prefix>’, ‘<column_prefix>’, …, ‘<column_prefix>’)

2,样例,取a,s开头的列

hbase(main):036:0> scan 'test_ns:user',{STARTROW=>'1000120190925',STOPROW=>'1000620190926',FILTER=>"MultipleColumnPrefixFilter('a','s')"}
ROW COLUMN+CELL
 1000120190925 column=info:age, timestamp=1570591004901, value=35
 1000120190925 column=info:sex, timestamp=1570591004880, value=Male
 1000220190926 column=info:age, timestamp=1569461617408, value=35
 1000320190926 column=info:age, timestamp=1569461617561, value=36
 1000320190926 column=info:sex, timestamp=1569461617538, value=male2
 1000420190926 column=info:age, timestamp=1569461617683, value=37
 1000420190926 column=info:sex, timestamp=1569461617666, value=male3
 1000520190926 column=info:age, timestamp=1569461709231, value=38
 1000520190926 column=info:sex, timestamp=1569461709213, value=none

六,ColumnCountGetFilter返回多少列

1,语法

ColumnCountGetFilter (‘<limit>’)

2,样例,取每行前二列

hbase(main):043:0> scan 'test_ns:user',{FILTER=>"ColumnCountGetFilter(2)"}
ROW COLUMN+CELL
 1000120190925 column=contact:mobile, timestamp=1570591004923, value=15854654215
 1000120190925 column=contact:tel, timestamp=1570591006670, value=02112345678
 1000220190926 column=contact:mobile, timestamp=1569461617446, value=15812345678
 1000220190926 column=contact:tel, timestamp=1569461617469, value=02122345678
 1000320190926 column=contact:mobile, timestamp=1569461617577, value=15822345678
 1000320190926 column=contact:tel, timestamp=1569461617594, value=02132345678
 1000420190926 column=contact:mobile, timestamp=1569461617701, value=15833345678
 1000420190926 column=contact:tel, timestamp=1569461617717, value=02142345678
 1000520190926 column=contact:mobile, timestamp=1569461709251, value=15803345678
 1000520190926 column=contact:tel, timestamp=1569461709916, value=02152345678
5 row(s) in 0.0100 seconds

七,PageFilter返回多少行

1,语法

PageFilter (‘<page_size>’)

2,样例,根据条件取一行

hbase(main):046:0> scan 'test_ns:user',{STARTROW=>'1000120190925',STOPROW=>'1000620190926',FILTER=>"PageFilter(1)"}
ROW COLUMN+CELL
 1000120190925 column=contact:mobile, timestamp=1570591004923, value=15854654215
 1000120190925 column=contact:tel, timestamp=1570591006670, value=02112345678
 1000120190925 column=info:age, timestamp=1570591004901, value=35
 1000120190925 column=info:sex, timestamp=1570591004880, value=Male
 1000120190925 column=login:password, timestamp=1570591004852, value=Pass1
 1000120190925 column=login:username, timestamp=1570591004809, value=Tank
1 row(s) in 0.0150 seconds

八,ColumnPaginationFilter根据limit和offset得到数据,

1,语法

ColumnPaginationFilter (‘<limit>’, ‘<offset>’)

2,样例,第二列开始,取一列

hbase(main):048:0> scan 'test_ns:user',{FILTER=>"ColumnPaginationFilter(1,2)"}
ROW COLUMN+CELL
 1000120190925 column=info:age, timestamp=1570591004901, value=35
 1000220190926 column=info:age, timestamp=1569461617408, value=35
 1000320190926 column=info:age, timestamp=1569461617561, value=36
 1000420190926 column=info:age, timestamp=1569461617683, value=37
 1000520190926 column=info:age, timestamp=1569461709231, value=38
 1000620191009 column=login:username, timestamp=1570592323820, value=tank
6 row(s) in 0.0110 seconds

九,InclusiveStopFilter设置停止的行

1,语法

InclusiveStopFilter (‘<stop_row_key>’)

2,样例,取数据截至到rowkey为1000220190926

hbase(main):051:0> scan 'test_ns:user',{FILTER=>"InclusiveStopFilter('1000220190926')"}
ROW COLUMN+CELL
 1000120190925 column=contact:mobile, timestamp=1570591004923, value=15854654215
 1000120190925 column=contact:tel, timestamp=1570591006670, value=02112345678
 1000120190925 column=info:age, timestamp=1570591004901, value=35
 1000120190925 column=info:sex, timestamp=1570591004880, value=Male
 1000120190925 column=login:password, timestamp=1570591004852, value=Pass1
 1000120190925 column=login:username, timestamp=1570591004809, value=Tank
 1000220190926 column=contact:mobile, timestamp=1569461617446, value=15812345678
 1000220190926 column=contact:tel, timestamp=1569461617469, value=02122345678
 1000220190926 column=info:age, timestamp=1569461617408, value=35
 1000220190926 column=login:password, timestamp=1569461617362, value=pass1
 1000220190926 column=login:username, timestamp=1569468833657, value=test1_111
2 row(s) in 0.0070 seconds

十,TimeStampsFilter取得指定时间戳的数据

1,语法

TimeStampsFilter (<timestamp>, <timestamp>, ... ,<timestamp>)

2,样例,取得时间戳为1569461617446,1570591004852的数据

hbase(main):052:0> scan 'test_ns:user',{FILTER=>"TimestampsFilter(1569461617446,1570591004852)"}
ROW COLUMN+CELL
 1000120190925 column=login:password, timestamp=1570591004852, value=Pass1
 1000220190926 column=contact:mobile, timestamp=1569461617446, value=15812345678
2 row(s) in 0.0200 seconds

十一,RowFilter根据rowkey的值过滤

1,语法

RowFilter (<compareOp>, ‘<row_comparator>’)

2,样例,rowkey小于1000220190926,并且取2列

hbase(main):067:0> scan 'test_ns:user',{FILTER=>"RowFilter(<=,'binary:1000220190926') AND ColumnCountGetFilter(2)"}
ROW COLUMN+CELL
 1000120190925 column=contact:mobile, timestamp=1570591004923, value=15854654215
 1000120190925 column=contact:tel, timestamp=1570591006670, value=02112345678
 1000220190926 column=contact:mobile, timestamp=1569461617446, value=15812345678
 1000220190926 column=contact:tel, timestamp=1569461617469, value=02122345678
2 row(s) in 0.0090 seconds

hbase(main):068:0> scan 'test_ns:user', FILTER=>"RowFilter(<=,'binary:1000220190926') AND ColumnCountGetFilter(2)"
ROW COLUMN+CELL
 1000120190925 column=contact:mobile, timestamp=1570591004923, value=15854654215
 1000120190925 column=contact:tel, timestamp=1570591006670, value=02112345678
 1000220190926 column=contact:mobile, timestamp=1569461617446, value=15812345678
 1000220190926 column=contact:tel, timestamp=1569461617469, value=02122345678
2 row(s) in 0.0090 seconds

十二,FamilyFilter根据列族过滤

1,语法

FamilyFilter (<compareOp>, ‘<family_comparator>’)

2,样例,列族为c开头,取一条数据

hbase(main):071:0> scan 'test_ns:user',{FILTER=>"FamilyFilter(=,'substring:c') AND PageFilter(1)"}
ROW COLUMN+CELL
 1000120190925 column=contact:mobile, timestamp=1570591004923, value=15854654215
 1000120190925 column=contact:tel, timestamp=1570591006670, value=02112345678
1 row(s) in 0.0070 seconds

注意:AND要大写,不然报错

十三,QualifierFilter根据列名过滤

1,语法

QualifierFilter (<compareOp>, ‘<qualifier_comparator>’)

2,样例,列匹配,取一条数据

hbase(main):075:0> scan 'test_ns:user',{FILTER=>"QualifierFilter(=,'regexstring:.obil.') AND PageFilter(2)"}
ROW COLUMN+CELL
 1000120190925 column=contact:mobile, timestamp=1570591004923, value=15854654215
 1000220190926 column=contact:mobile, timestamp=1569461617446, value=15812345678
2 row(s) in 0.0070 seconds

十四,ValueFilter根据值过滤,只返回匹配的列

1,语法

ValueFilter (<compareOp>, ‘<value_comparator>’)

2,样例,取得值为15854654215的列

hbase(main):077:0> scan 'test_ns:user',{FILTER=>"ValueFilter(=,'binary:15854654215')"}
ROW COLUMN+CELL
 1000120190925 column=contact:mobile, timestamp=1570591004923, value=15854654215
1 row(s) in 0.0080 seconds

十五. SingleColumnValueFilter根据列值返回行

1,语法

SingleColumnValueFilter (‘<family>’, ‘<qualifier>’, <compare operator>, ‘<comparator>’, <filterIfColumnMissing_boolean>, <latest_version_boolean>)
SingleColumnValueFilter (‘<family>’, ‘<qualifier>’, <compare operator>, ‘<comparator>’)

2,样例,列族contact下的列mobile,大于15812345678的数据,取一行

hbase(main):081:0> scan 'test_ns:user',{FILTER=>"SingleColumnValueFilter('contact','mobile',>=,'binary:15812345678') AND PageFilter(1)"}
ROW COLUMN+CELL
 1000120190925 column=contact:mobile, timestamp=1570591004923, value=15854654215
 1000120190925 column=contact:tel, timestamp=1570591006670, value=02112345678
 1000120190925 column=info:age, timestamp=1570591004901, value=35
 1000120190925 column=info:sex, timestamp=1570591004880, value=Male
 1000120190925 column=login:password, timestamp=1570591004852, value=Pass1
 1000120190925 column=login:username, timestamp=1570591004809, value=Tank
1 row(s) in 0.0070 seconds

十六,SingleColumnValueExcludeFilter根据列值返回行,排队条件列

1,语法

SingleColumnValueExcludeFilter (<family>, <qualifier>, <compare operators>, <comparator>, <latest_version_boolean>, <filterIfColumnMissing_boolean>)
SingleColumnValueExcludeFilter (<family>, <qualifier>, <compare operator> <comparator>)

2,样例

//有mobile
hbase(main):107:0> scan 'test_ns:user',{FILTER=>"SingleColumnValueFilter('contact','mobile',>=,'binary:15833345678') AND PageFilter(1)"}
ROW COLUMN+CELL
 1000120190925 column=contact:mobile, timestamp=1570591004923, value=15854654215
 1000120190925 column=contact:tel, timestamp=1570591006670, value=02112345678
 1000120190925 column=info:age, timestamp=1570591004901, value=35
 1000120190925 column=info:sex, timestamp=1570591004880, value=Male
 1000120190925 column=login:password, timestamp=1570591004852, value=Pass1
 1000120190925 column=login:username, timestamp=1570591004809, value=Tank
1 row(s) in 0.0070 seconds

//没有mobile
hbase(main):108:0> scan 'test_ns:user',{FILTER=>"SingleColumnValueExcludeFilter('contact','mobile',>=,'binary:15833345678') AND PageFilter(1)"}
ROW COLUMN+CELL
 1000120190925 column=contact:tel, timestamp=1570591006670, value=02112345678
 1000120190925 column=info:age, timestamp=1570591004901, value=35
 1000120190925 column=info:sex, timestamp=1570591004880, value=Male
 1000120190925 column=login:password, timestamp=1570591004852, value=Pass1
 1000120190925 column=login:username, timestamp=1570591004809, value=Tank
1 row(s) in 0.0070 seconds

十七,ColumnRangeFilter,取得二列这间的所有列数据

1,语法

ColumnRangeFilter (‘<minColumn >’, <minColumnInclusive_bool>, ‘<maxColumn>’, <maxColumnInclusive_bool>)

2,样例

hbase(main):101:0> scan 'test_ns:user',{FILTER=>"ColumnRangeFilter('mobile',false,'tel',false) AND PageFilter(2)"}
ROW COLUMN+CELL
 1000120190925 column=info:sex, timestamp=1570591004880, value=Male
 1000120190925 column=login:password, timestamp=1570591004852, value=Pass1
 1000220190926 column=login:password, timestamp=1569461617362, value=pass1
2 row(s) in 0.0060 seconds

hbase(main):102:0> scan 'test_ns:user',{FILTER=>"ColumnRangeFilter('mobile',true,'tel',false) AND PageFilter(2)"}
ROW COLUMN+CELL
 1000120190925 column=contact:mobile, timestamp=1570591004923, value=15854654215
 1000120190925 column=info:sex, timestamp=1570591004880, value=Male
 1000120190925 column=login:password, timestamp=1570591004852, value=Pass1
 1000220190926 column=contact:mobile, timestamp=1569461617446, value=15812345678
 1000220190926 column=login:password, timestamp=1569461617362, value=pass1
2 row(s) in 0.0060 seconds

注意:true表示,包含本列,false表示,不包含本列

十八,DependentColumnFilter一种允许用户指定一个参考列或引用列来过滤其他列的过滤器,过滤的原则是基于参考列的时间戳来进行筛选

1,语法

DependentColumnFilter (‘<family>’, ‘<qualifier>’, <boolean>, <compare operator>, ‘<value comparator’)
DependentColumnFilter (‘<family>’, ‘<qualifier>’, <boolean>)
DependentColumnFilter (‘<family>’, ‘<qualifier>’)

2,样例

hbase(main):018:0> scan 'test_ns:user',{FILTER=>"DependentColumnFilter('contact','mobile',false,>=,'binary:15813345678')"}
ROW COLUMN+CELL
 1000120190925 column=contact:mobile, timestamp=1570591004923, value=15854654215
 1000320190926 column=contact:mobile, timestamp=1569461617577, value=15822345678
 1000420190926 column=contact:mobile, timestamp=1569461617701, value=15833345678
3 row(s) in 0.0130 seconds


转载请注明
作者:海底苍鹰
地址:http://blog.51yip.com/hadoop/2189.html