elasticsearch ik 中文分词安装配置

张映发表于 2018-04-26

分类目录： elasticsearch, 服务器相关

标签：elasticsearch, ik, 中文分词, 全文检索

elasticsearch自带有中文分词，但是特别的傻，后面会做对比，在这里推荐analysis ik，用es来做全文检索工具的人员80%-90%会用这个中文分词工具，一直在更新维护。

1，elasticsearch分词器（analyzers）说明

elasticsearch中，内置了很多分词器（analyzers），例如standard （标准分词器）、english （英文分词）和chinese （中文分词）。

其中standard 就是无脑的一个一个词（汉字）切分，所以适用范围广，但是精准度低；

english 对英文更加智能，可以识别单数负数，大小写，过滤stopwords（例如“the”这个词）等；

2，安装maven

$ brew search maven      //mac
# apt-get install maven  //ubuntu
# yum install maven      //centos or redhat

$ mvn -v
Apache Maven 3.5.0 (ff8f5e7444045639af65f6095c62210b5713f426; 2017-04-04T03:39:06+08:00)
Maven home: /usr/local/Cellar/maven/3.5.0/libexec
Java version: 1.8.0_112, vendor: Oracle Corporation
Java home: /Library/Java/JavaVirtualMachines/jdk1.8.0_112.jdk/Contents/Home/jre
Default locale: zh_CN, platform encoding: UTF-8
OS name: "mac os x", version: "10.12.6", arch: "x86_64", family: "mac"

3，下载analysis ik插件

$ git clone https://github.com/medcl/elasticsearch-analysis-ik.git
$ cd elasticsearch-analysis-ik
$ git branch -a    //根据不同的es版本，进行git checkout
* master //主分支是6.2.3的
 remotes/origin/2.x
 remotes/origin/5.3.x
 remotes/origin/5.x
 remotes/origin/6.1.x
 remotes/origin/HEAD -> origin/master
 remotes/origin/arkxu-master
 remotes/origin/master
 remotes/origin/revert-80-patch-1
 remotes/origin/rm
 remotes/origin/wyhw-ik_lucene4

$ mvn package  //打包

$ ll target/releases/
total 4400
drwxr-xr-x 3 zhangying staff 102 4 24 13:46 ./
drwxr-xr-x 11 zhangying staff 374 4 24 13:32 ../
-rw-r--r-- 1 zhangying staff 4501993 4 24 13:32 elasticsearch-analysis-ik-6.2.3.zip

//在releases目录会生成一个zip文件，将其解压
$ cd target/releases/ && unzip elasticsearch-analysis-ik-6.2.3.zip

4，安装analysis ik插件

$ brew info elasticsearch
elasticsearch: stable 6.2.3, HEAD
Distributed search & analytics engine

https://www.elastic.co/products/elasticsearch

/usr/local/Cellar/elasticsearch/6.2.3 (112 files, 30.8MB) *
 Built from source on 2018-04-24 at 14:17:01
From: https://github.com/Homebrew/homebrew-core/blob/master/Formula/elasticsearch.rb
==> Requirements
Required: java = 1.8 ✔
==> Options
--HEAD
 Install HEAD version
==> Caveats
Data: /usr/local/var/lib/elasticsearch/elasticsearch_zhangying/
Logs: /usr/local/var/log/elasticsearch/elasticsearch_zhangying.log
Plugins: /usr/local/var/elasticsearch/plugins/  //插件路径
Config: /usr/local/etc/elasticsearch/

To have launchd start elasticsearch now and restart at login:
 brew services start elasticsearch
Or, if you don't want/need a background service you can just run:
 elasticsearch

//将刚才解压出来目录，移动plugins下面
$ mv elasticsearch /usr/local/var/elasticsearch/plugins/ik

在这里要注意，不要在elasticsearch.yml文件中加index:analysis:analyzer:，老版支持，但是es6.x尝试了几种办法都没有成功，会报以下错误：

node settings must not contain any index level settings

5，启动elasticsearch

$ elasticsearch  //启动

如果出现以下内容就说成功了

analysis-ik 中文分词

6，测试中文分词

//创建索引
$ curl -XPUT "http://127.0.0.1:9200/tank?pretty" 

//创建mapping
$ curl -XPOST "http://127.0.0.1:9200/tank/chinese/_mapping?pretty" -H "Content-Type: application/json" -d '
{
    "chinese": {
            "_all":{
              "enabled":false //禁止全字段全文检索
            },
            "properties": {
                "id": {
                    "type": "integer"
                },
                "username": {
                    "type": "text",
                    "analyzer": "ik_max_word" //精确分词模式
                },
                "description": {
                    "type": "text",
                    "analyzer": "ik_max_word"
                }
            }
        }
  }
'
//插入二条数据
$ curl -XPOST "http://127.0.0.1:9200/tank/chinese/?pretty"  -H "Content-Type: application/json" -d '
{
    "id" : 1,
    "username" :  "中国高铁速度很快",
    "description" :  "如果要修改一个字段的类型"
}'

$ curl -XPOST "http://127.0.0.1:9200/tank/chinese/?pretty"  -H "Content-Type: application/json" -d '
{
    "id" : 2,
    "username" :  "动车和复兴号，都属于高铁",
    "description" :  "现在想要修改为string类型"
}'

//搜索
$ curl -XPOST "http://127.0.0.1:9200/tank/chinese/_search?pretty"  -H "Content-Type: application/json"  -d '
> {
>     "query": {
>         "match": {
>             "username": "中国高铁"
>         }
>     }
> }
> '
{
  "took" : 188,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 2,
    "max_score" : 0.8630463,
    "hits" : [
      {
        "_index" : "tank",
        "_type" : "chinese",
        "_id" : "oJfx_2IBVvjz0l6TkJ6K",
        "_score" : 0.8630463,  //权重越高，匹配度越大
        "_source" : {
          "id" : 1,
          "username" : "中国高铁速度很快",
          "description" : "如果要修改一个字段的类型"
        }
      },
      {
        "_index" : "tank",
        "_type" : "chinese",
        "_id" : "oZfx_2IBVvjz0l6Tpp64",
        "_score" : 0.5753642,
        "_source" : {
          "id" : 2,
          "username" : "动车和复兴号，都属于高铁",
          "description" : "现在想要修改为string类型"
        }
      }
    ]
  }
}

7，elasticsearch内置中文分词和ik分词对比

$ curl -XPOST 'http://localhost:9200/tank/_analyze?pretty=true' -H 'Content-Type: application/json' -d '
> {
> "analyzer":"ik_smart",  //简短分词
> "text":"感叹号"
> }'
{
 "tokens" : [
 {
 "token" : "感叹号",
 "start_offset" : 0,
 "end_offset" : 3,
 "type" : "CN_WORD",
 "position" : 0
 }
 ]
}

$ curl -XPOST 'http://localhost:9200/tank/_analyze?pretty=true' -H 'Content-Type: application/json' -d '
> {
> "analyzer":"standard",  //es自带分词
> "text":"感叹号"
> }'
{
 "tokens" : [
 {
 "token" : "感",
 "start_offset" : 0,
 "end_offset" : 1,
 "type" : "<IDEOGRAPHIC>",
 "position" : 0
 },
 {
 "token" : "叹",
 "start_offset" : 1,
 "end_offset" : 2,
 "type" : "<IDEOGRAPHIC>",
 "position" : 1
 },
 {
 "token" : "号",
 "start_offset" : 2,
 "end_offset" : 3,
 "type" : "<IDEOGRAPHIC>",
 "position" : 2
 }
 ]
}

$ curl -XPOST 'http://localhost:9200/tank/_analyze?pretty=true' -H 'Content-Type: application/json' -d '
> {
> "analyzer":"ik_max_word",  //精确分词
> "text":"感叹号"
> }'
{
 "tokens" : [
 {
 "token" : "感叹号",
 "start_offset" : 0,
 "end_offset" : 3,
 "type" : "CN_WORD",
 "position" : 0
 },
 {
 "token" : "感叹",
 "start_offset" : 0,
 "end_offset" : 2,
 "type" : "CN_WORD",
 "position" : 1
 },
 {
 "token" : "叹号",
 "start_offset" : 1,
 "end_offset" : 3,
 "type" : "CN_WORD",
 "position" : 2
 }
 ]
}

转载请注明
作者:海底苍鹰
地址:http://blog.51yip.com/server/1892.html

留下评论

抱歉，发表回复评论您必须登录。

海底苍鹰(tank)博客

－－一步，二步，三步，N步，二行脚印

赞助本站

关于我

留言板

开发手册

linux命令

首页

elasticsearch ik 中文分词安装配置

留下评论

分类目录

最近文章

最近评论和留言

登录

海底苍鹰(tank)博客

－－一步，二步，三步，N步，二行脚印

赞助本站 关于我 留言板 开发手册 linux命令 首页

elasticsearch ik 中文分词 安装配置

留下评论

分类目录

最近文章

最近评论和留言

登录

赞助本站

关于我

留言板

开发手册

linux命令

首页

elasticsearch ik 中文分词安装配置