令牌生成器与令牌过滤器

编程入门 行业动态 更新时间:2024-10-19 00:26:16
本文介绍了令牌生成器与令牌过滤器的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧! 问题描述

我正在尝试使用Elasticsearch来实现自动完成功能,以为我理解该怎么做...

我正在尝试构建多字词(词组)

令牌生成器和<$ c有什么区别?

$ c> token_filter -我已经阅读了这些文档,但仍需要更多了解...。

例如token_filter ES用来搜索用户输入的内容是什么? ES用于制作令牌的令牌生成器吗?什么是令牌?

ES是否可以使用其中任何一种来创建多词建议?

解决方案

令牌生成器会将整个输入拆分为令牌,令牌过滤器将对每个令牌应用一些转换。

例如,假设输入的是棕色狐狸。如果您使用edgeNGram 令牌,则会获得以下令牌:

  • T
  • Th
  • The (最后一个字符是空格)
  • q
  • qu
  • quic
  • quic
  • 快捷键
  • 快捷键(最后一个字符是一个空格)
  • 快速b
  • 快速br
  • 快速兄弟
  • 快速眉毛
  • 快速眉毛
  • 快速棕色(最后一个字符是空格)
  • 快速棕色f
  • 快速棕色狐狸
  • 快速棕色狐狸

但是,如果您使用标准的分词器将输入拆分变成单词/令牌,然后是edgeNGram 令牌过滤器,您将获得以下令牌

  • T , Th , The
  • q , qu , qui , quic ,快速
  • b , br , bro ,浏览
  • f , fo , fox

如您所见,在edgeNgram 令牌生成器或令牌过滤器取决于您要如何对文本进行切片和切块以及如何进行搜索。

我建议您查看出色的 elyzer 工具,该工具提供了一种可视化分析过程并查看每个过程中产生的结果的方法。

从ES 2.2开始, _analyze 端点也弹出一个解释功能,其中显示了分析过程中每个步骤的详细信息。

I'm trying to implement autocomplete using Elasticsearch thinking that I understand how to do it...

I'm trying to build multi-word (phrase) suggestions by using ES's edge_n_grams while indexing crawled data.

What is the difference between a tokenizer and a token_filter - I've read the docs on these but still need more understanding on them....

For instance is a token_filter what ES uses to search against user input? Is a tokenizer what ES uses to make tokens? What is a token?

Is it possible for ES to create multi-word suggestions using any of these things?

解决方案

A tokenizer will split the whole input into tokens and a token filter will apply some transformation on each token.

For instance, let's say the input is The quick brown fox. If you use an edgeNGram tokenizer, you'll get the following tokens:

  • T
  • Th
  • The
  • The (last character is a space)
  • The q
  • The qu
  • The qui
  • The quic
  • The quick
  • The quick (last character is a space)
  • The quick b
  • The quick br
  • The quick bro
  • The quick brow
  • The quick brown
  • The quick brown (last character is a space)
  • The quick brown f
  • The quick brown fo
  • The quick brown fox

However, if you use a standard tokenizer which will split the input into words/tokens, and then an edgeNGram token filter, you'll get the following tokens

  • T, Th, The
  • q, qu, qui, quic, quick
  • b, br, bro, brow, brown
  • f, fo, fox

As you can see, choosing between an edgeNgram tokenizer or token filter depends on how you want to slice and dice your text and how you want to search it.

I suggest having a look at the excellent elyzer tool which provides a way to visualize the analysis process and see what is being produced during each step (tokenizing and token filtering).

As of ES 2.2, the _analyze endpoint also supports an explain feature which shows the details during each step of the analysis process.

更多推荐

令牌生成器与令牌过滤器

本文发布于:2023-11-30 04:24:31,感谢您对本站的认可!
本文链接:https://www.elefans.com/category/jswz/34/1648713.html
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。
本文标签:令牌   生成器   过滤器

发布评论

评论列表 (有 0 条评论)
草根站长

>www.elefans.com

编程频道|电子爱好者 - 技术资讯及电子产品介绍!