elasticsearch自带有中文分词,但是特别的傻,后面会做对比,在这里推荐analysis ik,用es来做全文检索工具的人员80%-90%会用这个中文分词工具,一直在更新维护。
1,elasticsearch分词器(analyzers)说明
elasticsearch中,内置了很多分词器(analyzers),例如standard (标准分词器)、english (英文分词)和chinese (中文分词)。
其中standard 就是无脑的一个一个词(汉字)切分,所以适用范围广,但是精准度低;
english 对英文更加智能,可以识别单数负数,大小写,过滤stopwords(例如“the”这个词)等;
2,安装maven
$ brew search maven //mac # apt-get install maven //ubuntu # yum install maven //centos or redhat $ mvn -v Apache Maven 3.5.0 (ff8f5e7444045639af65f6095c62210b5713f426; 2017-04-04T03:39:06+08:00) Maven home: /usr/local/Cellar/maven/3.5.0/libexec Java version: 1.8.0_112, vendor: Oracle Corporation Java home: /Library/Java/JavaVirtualMachines/jdk1.8.0_112.jdk/Contents/Home/jre Default locale: zh_CN, platform encoding: UTF-8 OS name: "mac os x", version: "10.12.6", arch: "x86_64", family: "mac"
3,下载analysis ik插件
$ git clone https://github.com/medcl/elasticsearch-analysis-ik.git $ cd elasticsearch-analysis-ik $ git branch -a //根据不同的es版本,进行git checkout * master //主分支是6.2.3的 remotes/origin/2.x remotes/origin/5.3.x remotes/origin/5.x remotes/origin/6.1.x remotes/origin/HEAD -> origin/master remotes/origin/arkxu-master remotes/origin/master remotes/origin/revert-80-patch-1 remotes/origin/rm remotes/origin/wyhw-ik_lucene4 $ mvn package //打包 $ ll target/releases/ total 4400 drwxr-xr-x 3 zhangying staff 102 4 24 13:46 ./ drwxr-xr-x 11 zhangying staff 374 4 24 13:32 ../ -rw-r--r-- 1 zhangying staff 4501993 4 24 13:32 elasticsearch-analysis-ik-6.2.3.zip //在releases目录会生成一个zip文件,将其解压 $ cd target/releases/ && unzip elasticsearch-analysis-ik-6.2.3.zip
4,安装analysis ik插件
$ brew info elasticsearch elasticsearch: stable 6.2.3, HEAD Distributed search & analytics engine https://www.elastic.co/products/elasticsearch /usr/local/Cellar/elasticsearch/6.2.3 (112 files, 30.8MB) * Built from source on 2018-04-24 at 14:17:01 From: https://github.com/Homebrew/homebrew-core/blob/master/Formula/elasticsearch.rb ==> Requirements Required: java = 1.8 ✔ ==> Options --HEAD Install HEAD version ==> Caveats Data: /usr/local/var/lib/elasticsearch/elasticsearch_zhangying/ Logs: /usr/local/var/log/elasticsearch/elasticsearch_zhangying.log Plugins: /usr/local/var/elasticsearch/plugins/ //插件路径 Config: /usr/local/etc/elasticsearch/ To have launchd start elasticsearch now and restart at login: brew services start elasticsearch Or, if you don't want/need a background service you can just run: elasticsearch //将刚才解压出来目录,移动plugins下面 $ mv elasticsearch /usr/local/var/elasticsearch/plugins/ik
在这里要注意,不要在elasticsearch.yml文件中加index:analysis:analyzer:,老版支持,但是es6.x尝试了几种办法都没有成功,会报以下错误:
node settings must not contain any index level settings
5,启动elasticsearch
$ elasticsearch //启动
如果出现以下内容就说成功了
6,测试中文分词
//创建索引
$ curl -XPUT "http://127.0.0.1:9200/tank?pretty"
//创建mapping
$ curl -XPOST "http://127.0.0.1:9200/tank/chinese/_mapping?pretty" -H "Content-Type: application/json" -d '
{
"chinese": {
"_all":{
"enabled":false //禁止全字段全文检索
},
"properties": {
"id": {
"type": "integer"
},
"username": {
"type": "text",
"analyzer": "ik_max_word" //精确分词模式
},
"description": {
"type": "text",
"analyzer": "ik_max_word"
}
}
}
}
'
//插入二条数据
$ curl -XPOST "http://127.0.0.1:9200/tank/chinese/?pretty" -H "Content-Type: application/json" -d '
{
"id" : 1,
"username" : "中国高铁速度很快",
"description" : "如果要修改一个字段的类型"
}'
$ curl -XPOST "http://127.0.0.1:9200/tank/chinese/?pretty" -H "Content-Type: application/json" -d '
{
"id" : 2,
"username" : "动车和复兴号,都属于高铁",
"description" : "现在想要修改为string类型"
}'
//搜索
$ curl -XPOST "http://127.0.0.1:9200/tank/chinese/_search?pretty" -H "Content-Type: application/json" -d '
> {
> "query": {
> "match": {
> "username": "中国高铁"
> }
> }
> }
> '
{
"took" : 188,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : 2,
"max_score" : 0.8630463,
"hits" : [
{
"_index" : "tank",
"_type" : "chinese",
"_id" : "oJfx_2IBVvjz0l6TkJ6K",
"_score" : 0.8630463, //权重越高,匹配度越大
"_source" : {
"id" : 1,
"username" : "中国高铁速度很快",
"description" : "如果要修改一个字段的类型"
}
},
{
"_index" : "tank",
"_type" : "chinese",
"_id" : "oZfx_2IBVvjz0l6Tpp64",
"_score" : 0.5753642,
"_source" : {
"id" : 2,
"username" : "动车和复兴号,都属于高铁",
"description" : "现在想要修改为string类型"
}
}
]
}
}
7,elasticsearch内置中文分词和ik分词对比
$ curl -XPOST 'http://localhost:9200/tank/_analyze?pretty=true' -H 'Content-Type: application/json' -d '
> {
> "analyzer":"ik_smart", //简短分词
> "text":"感叹号"
> }'
{
"tokens" : [
{
"token" : "感叹号",
"start_offset" : 0,
"end_offset" : 3,
"type" : "CN_WORD",
"position" : 0
}
]
}
$ curl -XPOST 'http://localhost:9200/tank/_analyze?pretty=true' -H 'Content-Type: application/json' -d '
> {
> "analyzer":"standard", //es自带分词
> "text":"感叹号"
> }'
{
"tokens" : [
{
"token" : "感",
"start_offset" : 0,
"end_offset" : 1,
"type" : "<IDEOGRAPHIC>",
"position" : 0
},
{
"token" : "叹",
"start_offset" : 1,
"end_offset" : 2,
"type" : "<IDEOGRAPHIC>",
"position" : 1
},
{
"token" : "号",
"start_offset" : 2,
"end_offset" : 3,
"type" : "<IDEOGRAPHIC>",
"position" : 2
}
]
}
$ curl -XPOST 'http://localhost:9200/tank/_analyze?pretty=true' -H 'Content-Type: application/json' -d '
> {
> "analyzer":"ik_max_word", //精确分词
> "text":"感叹号"
> }'
{
"tokens" : [
{
"token" : "感叹号",
"start_offset" : 0,
"end_offset" : 3,
"type" : "CN_WORD",
"position" : 0
},
{
"token" : "感叹",
"start_offset" : 0,
"end_offset" : 2,
"type" : "CN_WORD",
"position" : 1
},
{
"token" : "叹号",
"start_offset" : 1,
"end_offset" : 3,
"type" : "CN_WORD",
"position" : 2
}
]
}
转载请注明
作者:海底苍鹰
地址:http://blog.51yip.com/server/1892.html
