elasticsearch自带有中文分词,但是特别的傻,后面会做对比,在这里推荐analysis ik,用es来做全文检索工具的人员80%-90%会用这个中文分词工具,一直在更新维护。
1,elasticsearch分词器(analyzers)说明
elasticsearch中,内置了很多分词器(analyzers),例如standard (标准分词器)、english (英文分词)和chinese (中文分词)。
其中standard 就是无脑的一个一个词(汉字)切分,所以适用范围广,但是精准度低;
english 对英文更加智能,可以识别单数负数,大小写,过滤stopwords(例如“the”这个词)等;
2,安装maven
- $ brew search maven //mac
- # apt-get install maven //ubuntu
- # yum install maven //centos or redhat
- $ mvn -v
- Apache Maven 3.5.0 (ff8f5e7444045639af65f6095c62210b5713f426; 2017-04-04T03:39:06+08:00)
- Maven home: /usr/local/Cellar/maven/3.5.0/libexec
- Java version: 1.8.0_112, vendor: Oracle Corporation
- Java home: /Library/Java/JavaVirtualMachines/jdk1.8.0_112.jdk/Contents/Home/jre
- Default locale: zh_CN, platform encoding: UTF-8
- OS name: "mac os x", version: "10.12.6", arch: "x86_64", family: "mac"
3,下载analysis ik插件
- $ git clone https://github.com/medcl/elasticsearch-analysis-ik.git
- $ cd elasticsearch-analysis-ik
- $ git branch -a //根据不同的es版本,进行git checkout
- * master //主分支是6.2.3的
- remotes/origin/2.x
- remotes/origin/5.3.x
- remotes/origin/5.x
- remotes/origin/6.1.x
- remotes/origin/HEAD -> origin/master
- remotes/origin/arkxu-master
- remotes/origin/master
- remotes/origin/revert-80-patch-1
- remotes/origin/rm
- remotes/origin/wyhw-ik_lucene4
- $ mvn package //打包
- $ ll target/releases/
- total 4400
- drwxr-xr-x 3 zhangying staff 102 4 24 13:46 ./
- drwxr-xr-x 11 zhangying staff 374 4 24 13:32 ../
- -rw-r--r-- 1 zhangying staff 4501993 4 24 13:32 elasticsearch-analysis-ik-6.2.3.zip
- //在releases目录会生成一个zip文件,将其解压
- $ cd target/releases/ && unzip elasticsearch-analysis-ik-6.2.3.zip
4,安装analysis ik插件
- $ brew info elasticsearch
- elasticsearch: stable 6.2.3, HEAD
- Distributed search & analytics engine
- https://www.elastic.co/products/elasticsearch
- /usr/local/Cellar/elasticsearch/6.2.3 (112 files, 30.8MB) *
- Built from source on 2018-04-24 at 14:17:01
- From: https://github.com/Homebrew/homebrew-core/blob/master/Formula/elasticsearch.rb
- ==> Requirements
- Required: java = 1.8 ✔
- ==> Options
- --HEAD
- Install HEAD version
- ==> Caveats
- Data: /usr/local/var/lib/elasticsearch/elasticsearch_zhangying/
- Logs: /usr/local/var/log/elasticsearch/elasticsearch_zhangying.log
- Plugins: /usr/local/var/elasticsearch/plugins/ //插件路径
- Config: /usr/local/etc/elasticsearch/
- To have launchd start elasticsearch now and restart at login:
- brew services start elasticsearch
- Or, if you don't want/need a background service you can just run:
- elasticsearch
- //将刚才解压出来目录,移动plugins下面
- $ mv elasticsearch /usr/local/var/elasticsearch/plugins/ik
在这里要注意,不要在elasticsearch.yml文件中加index:analysis:analyzer:,老版支持,但是es6.x尝试了几种办法都没有成功,会报以下错误:
node settings must not contain any index level settings
5,启动elasticsearch
- $ elasticsearch //启动
如果出现以下内容就说成功了
6,测试中文分词
- //创建索引
- $ curl -XPUT "http://127.0.0.1:9200/tank?pretty"
- //创建mapping
- $ curl -XPOST "http://127.0.0.1:9200/tank/chinese/_mapping?pretty" -H "Content-Type: application/json" -d '
- {
- "chinese": {
- "_all":{
- "enabled":false //禁止全字段全文检索
- },
- "properties": {
- "id": {
- "type": "integer"
- },
- "username": {
- "type": "text",
- "analyzer": "ik_max_word" //精确分词模式
- },
- "description": {
- "type": "text",
- "analyzer": "ik_max_word"
- }
- }
- }
- }
- '
- //插入二条数据
- $ curl -XPOST "http://127.0.0.1:9200/tank/chinese/?pretty" -H "Content-Type: application/json" -d '
- {
- "id" : 1,
- "username" : "中国高铁速度很快",
- "description" : "如果要修改一个字段的类型"
- }'
- $ curl -XPOST "http://127.0.0.1:9200/tank/chinese/?pretty" -H "Content-Type: application/json" -d '
- {
- "id" : 2,
- "username" : "动车和复兴号,都属于高铁",
- "description" : "现在想要修改为string类型"
- }'
- //搜索
- $ curl -XPOST "http://127.0.0.1:9200/tank/chinese/_search?pretty" -H "Content-Type: application/json" -d '
- > {
- > "query": {
- > "match": {
- > "username": "中国高铁"
- > }
- > }
- > }
- > '
- {
- "took" : 188,
- "timed_out" : false,
- "_shards" : {
- "total" : 5,
- "successful" : 5,
- "skipped" : 0,
- "failed" : 0
- },
- "hits" : {
- "total" : 2,
- "max_score" : 0.8630463,
- "hits" : [
- {
- "_index" : "tank",
- "_type" : "chinese",
- "_id" : "oJfx_2IBVvjz0l6TkJ6K",
- "_score" : 0.8630463, //权重越高,匹配度越大
- "_source" : {
- "id" : 1,
- "username" : "中国高铁速度很快",
- "description" : "如果要修改一个字段的类型"
- }
- },
- {
- "_index" : "tank",
- "_type" : "chinese",
- "_id" : "oZfx_2IBVvjz0l6Tpp64",
- "_score" : 0.5753642,
- "_source" : {
- "id" : 2,
- "username" : "动车和复兴号,都属于高铁",
- "description" : "现在想要修改为string类型"
- }
- }
- ]
- }
- }
7,elasticsearch内置中文分词和ik分词对比
- $ curl -XPOST 'http://localhost:9200/tank/_analyze?pretty=true' -H 'Content-Type: application/json' -d '
- > {
- > "analyzer":"ik_smart", //简短分词
- > "text":"感叹号"
- > }'
- {
- "tokens" : [
- {
- "token" : "感叹号",
- "start_offset" : 0,
- "end_offset" : 3,
- "type" : "CN_WORD",
- "position" : 0
- }
- ]
- }
- $ curl -XPOST 'http://localhost:9200/tank/_analyze?pretty=true' -H 'Content-Type: application/json' -d '
- > {
- > "analyzer":"standard", //es自带分词
- > "text":"感叹号"
- > }'
- {
- "tokens" : [
- {
- "token" : "感",
- "start_offset" : 0,
- "end_offset" : 1,
- "type" : "<IDEOGRAPHIC>",
- "position" : 0
- },
- {
- "token" : "叹",
- "start_offset" : 1,
- "end_offset" : 2,
- "type" : "<IDEOGRAPHIC>",
- "position" : 1
- },
- {
- "token" : "号",
- "start_offset" : 2,
- "end_offset" : 3,
- "type" : "<IDEOGRAPHIC>",
- "position" : 2
- }
- ]
- }
- $ curl -XPOST 'http://localhost:9200/tank/_analyze?pretty=true' -H 'Content-Type: application/json' -d '
- > {
- > "analyzer":"ik_max_word", //精确分词
- > "text":"感叹号"
- > }'
- {
- "tokens" : [
- {
- "token" : "感叹号",
- "start_offset" : 0,
- "end_offset" : 3,
- "type" : "CN_WORD",
- "position" : 0
- },
- {
- "token" : "感叹",
- "start_offset" : 0,
- "end_offset" : 2,
- "type" : "CN_WORD",
- "position" : 1
- },
- {
- "token" : "叹号",
- "start_offset" : 1,
- "end_offset" : 3,
- "type" : "CN_WORD",
- "position" : 2
- }
- ]
- }
转载请注明
作者:海底苍鹰
地址:http://blog.51yip.com/server/1892.html