elasticsearch 拼音 中文 分词 混合使用

张映 发表于 2018-04-27

分类目录: elasticsearch, 服务器相关

标签:, , , ,

前一篇文章说了IK中文分词,其实想实现的目的,就是拼音和中文都搜索到东西。类似百度搜索框的输入提示,淘宝搜索框的输入提示。

1,安装配置analysis-pinyin

//下载
$ git clone https://github.com/medcl/elasticsearch-analysis-pinyin.git
$ cd elasticsearch-analysis-pinyin
$ git branch -a
* master //主分支是6.2.3,对应 es6.2.3
 remotes/origin/0.16.x
 remotes/origin/1.x
 remotes/origin/2.x
 remotes/origin/5.3.x
 remotes/origin/5.x
 remotes/origin/6.1.x
 remotes/origin/HEAD -> origin/master
 remotes/origin/master

$ mvn package  //打包

$ ll target/releases/
total 4400
drwxr-xr-x 3 zhangying staff 102 4 24 13:46 ./
drwxr-xr-x 11 zhangying staff 374 4 24 13:32 ../
-rw-r--r-- 1 zhangying staff 4501993 4 24 13:32 elasticsearch-analysis-pinyin-6.2.3.zip

$ cd target/releases/ && unzip elasticsearch-analysis-pinyin-6.2.3.zip

$ brew info elasticsearch
elasticsearch: stable 6.2.3, HEAD
Distributed search & analytics engine

https://www.elastic.co/products/elasticsearch

/usr/local/Cellar/elasticsearch/6.2.3 (112 files, 30.8MB) *
 Built from source on 2018-04-24 at 14:17:01
From: https://github.com/Homebrew/homebrew-core/blob/master/Formula/elasticsearch.rb
==> Requirements
Required: java = 1.8 ✔
==> Options
--HEAD
 Install HEAD version
==> Caveats
Data: /usr/local/var/lib/elasticsearch/elasticsearch_zhangying/
Logs: /usr/local/var/log/elasticsearch/elasticsearch_zhangying.log
Plugins: /usr/local/var/elasticsearch/plugins/   //插件地址
Config: /usr/local/etc/elasticsearch/

To have launchd start elasticsearch now and restart at login:
 brew services start elasticsearch
Or, if you don't want/need a background service you can just run:
 elasticsearch

//将mvn后的插件copy到es插件目录
$ mv elasticsearch /usr/local/var/elasticsearch/plugins/pinyin

$ elasticsearch  //启动

2,测试pinyin分词

2.1,测试分词

$ curl -XPOST 'http://localhost:9200/pinyin/_analyze?pretty=true' -H 'Content-Type: application/json' -d '
> {
> "analyzer":"pinyin",
> "text":"gaotie"
> }'
{
 "tokens" : [
 {
 "token" : "gao",
 "start_offset" : 0,
 "end_offset" : 0,
 "type" : "word",
 "position" : 0
 },
 {
 "token" : "gaotie",
 "start_offset" : 0,
 "end_offset" : 0,
 "type" : "word",
 "position" : 0
 },
 {
 "token" : "tie",
 "start_offset" : 0,
 "end_offset" : 0,
 "type" : "word",
 "position" : 1
 }
 ]
}

$ curl -XPOST 'http://localhost:9200/pinyin/_analyze?pretty=true' -H 'Content-Type: application/json' -d '
> {
> "analyzer":"pinyin",
> "text":"高铁"
> }'
{
 "tokens" : [
 {
 "token" : "gao",
 "start_offset" : 0,
 "end_offset" : 0,
 "type" : "word",
 "position" : 0
 },
 {
 "token" : "gt",
 "start_offset" : 0,
 "end_offset" : 0,
 "type" : "word",
 "position" : 0
 },
 {
 "token" : "tie",
 "start_offset" : 0,
 "end_offset" : 0,
 "type" : "word",
 "position" : 1
 }
 ]
}

从上面可以看出,pinyin分词对pinyin和中文都能分的,并且分出来的结果还不一样。

2.2,创建索引,mapping,插入数据

curl -XPUT "http://127.0.0.1:9200/pinyin?pretty"
curl -XPOST "http://127.0.0.1:9200/pinyin/test/_mapping?pretty" -H "Content-Type: application/json" -d '
{
    "test": {
            "_all":{
              "enabled":false
            },
            "properties": {
                "id": {
                    "type": "integer"
                },
                "username": {
                    "type": "text",
                    "analyzer": "pinyin"
                },
                "description": {
                    "type": "text",
                    "analyzer": "pinyin"
                }
            }
        }
  }
'
curl -XPOST "http://127.0.0.1:9200/pinyin/test/?pretty"  -H "Content-Type: application/json" -d '
{
    "id" : 1,
    "username" :  "中国高铁速度很快",
    "description" :  "如果要修改一个字段的类型"
}'

curl -XPOST "http://127.0.0.1:9200/pinyin/test/?pretty"   -H "Content-Type: application/json" -d '
{
    "id" : 2,
    "username" :  "动车和复兴号,都属于高铁",
    "description" :  "现在想要修改为string类型"
}'

2.3,全拼音测试

$ curl -XPOST "http://127.0.0.1:9200/pinyin/test/_search?pretty"  -H "Content-Type: application/json"  -d '
> {
>     "query": {
>         "match": {
>             "username": "gao tie"
>         }
>     }
> }
> '
{
  "took" : 13,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 2,
    "max_score" : 0.4039931,
    "hits" : [
      {
        "_index" : "pinyin",
        "_type" : "test",
        "_id" : "TGZ2AWMBlEkarXCPb7ED",
        "_score" : 0.4039931,
        "_source" : {
          "id" : 1,
          "username" : "中国高铁速度很快",
          "description" : "如果要修改一个字段的类型"
        }
      },
      {
        "_index" : "pinyin",
        "_type" : "test",
        "_id" : "TWZ2AWMBlEkarXCPb7En",
        "_score" : 0.35767543,
        "_source" : {
          "id" : 2,
          "username" : "动车和复兴号,都属于高铁",
          "description" : "现在想要修改为string类型"
        }
      }
    ]
  }
}

2.3,拼音分词,汉字搜索

$ curl -XPOST "http://127.0.0.1:9200/pinyin/test/_search?pretty"  -H "Content-Type: application/json"  -d '
> {
>     "query": {
>         "match": {
>             "username": "中国高铁"
>         }
>     }
> }
> '
{
  "took" : 3,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 2,
    "max_score" : 1.9398875,
    "hits" : [
      {
        "_index" : "pinyin",
        "_type" : "test",
        "_id" : "TGZ2AWMBlEkarXCPb7ED",
        "_score" : 1.9398875,
        "_source" : {
          "id" : 1,
          "username" : "中国高铁速度很快",
          "description" : "如果要修改一个字段的类型"
        }
      },
      {
        "_index" : "pinyin",
        "_type" : "test",
        "_id" : "TWZ2AWMBlEkarXCPb7En",
        "_score" : 0.35767543,
        "_source" : {
          "id" : 2,
          "username" : "动车和复兴号,都属于高铁",
          "description" : "现在想要修改为string类型"
        }
      }
    ]
  }
}

2.4,部分首字母

$ curl -XPOST "http://127.0.0.1:9200/pinyin/test/_search?pretty"  -H "Content-Type: application/json"  -d '
> {
>     "query": {
>         "match": {
>             "username": "Gaot"
>         }
>     }
> }
> '
{
  "took" : 5,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 2,
    "max_score" : 0.20199655,
    "hits" : [
      {
        "_index" : "pinyin",
        "_type" : "test",
        "_id" : "TGZ2AWMBlEkarXCPb7ED",
        "_score" : 0.20199655,
        "_source" : {
          "id" : 1,
          "username" : "中国高铁速度很快",
          "description" : "如果要修改一个字段的类型"
        }
      },
      {
        "_index" : "pinyin",
        "_type" : "test",
        "_id" : "TWZ2AWMBlEkarXCPb7En",
        "_score" : 0.17883772,
        "_source" : {
          "id" : 2,
          "username" : "动车和复兴号,都属于高铁",
          "description" : "现在想要修改为string类型"
        }
      }
    ]
  }
}

//同上
$ curl -XPOST "http://127.0.0.1:9200/pinyin/test/_search?pretty"  -H "Content-Type: application/json"  -d '
{
    "query": {
        "match": {
            "username": "gtie"
        }
    }
}
'

2.5,全首字母搜索

$ curl -XPOST "http://127.0.0.1:9200/pinyin/test/_search?pretty"  -H "Content-Type: application/json"  -d '
> {
>     "query": {
>         "match": {
>             "username": "gt"
>         }
>     }
> }
> '
{
  "took" : 3,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 0,
    "max_score" : null,
    "hits" : [ ]
  }
}

全首字母高铁(gt),没有搜索到东西。

3,拼音分词和中文分词混合使用

3.1,自定义analyzer,并设置过滤器

$ curl -XPUT "http://localhost:9200/pinyin_ik/?pretty" -H "Content-Type: application/json" -d'
{
    "index": {
        "analysis": {
            "analyzer": {
                "ik_pinyin_analyzer": {
                    "type": "custom",
                    "tokenizer": "ik_max_word",
                    "filter": ["my_pinyin", "word_delimiter"]
                }
            },
            "filter": {
                "my_pinyin": {
                    "type": "pinyin"
                }
            }
        }
    }
}'

$ curl -XPOST "http://127.0.0.1:9200/pinyin_ik/test/_mapping?pretty" -H "Content-Type: application/json" -d '
{
    "test": {
            "_all":{
              "enabled":false
            },
            "properties": {
                "id": {
                    "type": "integer"
                },
                "username": {
                    "type": "text",
                    "analyzer": "ik_pinyin_analyzer"
                },
                "description": {
                    "type": "text",
                    "analyzer": "ik_pinyin_analyzer"
                }
            }
        }
  }
'  

$ curl -XPOST "http://127.0.0.1:9200/pinyin_ik/test/?pretty"  -H "Content-Type: application/json" -d '
{
    "id" : 1,
    "username" :  "中国高铁速度很快",
    "description" :  "如果要修改一个字段的类型"
}'

$ curl -XPOST "http://127.0.0.1:9200/pinyin_ik/test/?pretty"   -H "Content-Type: application/json" -d '
{
    "id" : 2,
    "username" :  "动车和复兴号,都属于高铁",
    "description" :  "现在想要修改为string类型"
}'

3.2,全首字母搜索

$ curl -XPOST "http://127.0.0.1:9200/pinyin_ik/test/_search?pretty"  -H "Content-Type: application/json"  -d '
> {
>     "query": {
>         "match": {
>             "username": "gt"
>         }
>     }
> }
> '
{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 2,
    "max_score" : 0.6935897,
    "hits" : [
      {
        "_index" : "pinyin_ik",
        "_type" : "test",
        "_id" : "S2ZzAWMBlEkarXCPu7Hq",
        "_score" : 0.6935897,
        "_source" : {
          "id" : 2,
          "username" : "动车和复兴号,都属于高铁",
          "description" : "现在想要修改为string类型"
        }
      },
      {
        "_index" : "pinyin_ik",
        "_type" : "test",
        "_id" : "SmZzAWMBlEkarXCPubHw",
        "_score" : 0.6827974,
        "_source" : {
          "id" : 1,
          "username" : "中国高铁速度很快",
          "description" : "如果要修改一个字段的类型"
        }
      }
    ]
  }
}


转载请注明
作者:海底苍鹰
地址:http://blog.51yip.com/server/1894.html