ngram令牌过滤器与ngram令牌过滤器有何不同？

编程入门行业动态更新时间:2024-10-25 14:33:05

本文介绍了ngram令牌过滤器与ngram令牌过滤器有何不同？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！问题描述

由于我是弹性搜索的新手，我无法识别 ngram令牌过滤器和边缘ngram令牌过滤器之间的区别。

As I am new to elastic search, I am not able to identify difference between ngram token filter and edge ngram token filter.

在处理令牌中，这两个不同之处如何？

How these two differ from each other in processing tokens?

推荐答案

我认为文档是非常清楚的：

这个分类器非常类似于nGram，但只保留从一开始就开始的n-gram令牌。

This tokenizer is very similar to nGram but only keeps n-grams which start at the beginning of a token.

而 nGram tokenizer的最佳示例再次来自文档：

And the best example for nGram tokenizer again comes from the documentation:

curl 'localhost:9200/test/_analyze?pretty=1&analyzer=my_ngram_analyzer' -d 'FC Schalke 04' # FC, Sc, Sch, ch, cha, ha, hal, al, alk, lk, lke, ke, 04

使用此分类器定义：

"type" : "nGram", "min_gram" : "2", "max_gram" : "3", "token_chars": [ "letter", "digit" ]

简而言之：

根据配置，tokenizer将创建令牌。在这个例子中： FC ， Schalke ， 04 / li>
nGram 生成最小 min_gram 大小和最大 max_gram 大小从输入文本。基本上，令牌被分割成小块，每个块都被固定在一个角色上（这个角色无关紧要，所有这些都会创建块）。
edgeNGram 执行相同操作，但这些块总是从每个令牌的开头开始。基本上，这些块被固定在标记的开头。

the tokenizer, depending on the configuration, will create tokens. In this example: FC, Schalke, 04.
nGram generates groups of characters of minimum min_gram size and maximum max_gram size from an input text. Basically, the tokens are split into small chunks and each chunk is anchored on a character (it doesn't matter where this character is, all of them will create chunks).
edgeNGram does the same but the chunks always start from the beginning of each token. Basically, the chunks are anchored at the beginning of the tokens.

对于与上述相同的文本， edgeNGram 生成： FC，Sc，Sch，Scha，Schal，04 。考虑文本中的每个单词，对于每个单词，第一个字符是起始点（ F 从 FC

For the same text as above, an edgeNGram generates this: FC, Sc, Sch, Scha, Schal, 04. Every "word" in the text is considered and for every "word" the first character is the starting point (F from FC, S from Schalke and 0 from 04).