由于我是弹性搜索的新手,我无法识别 ngram令牌过滤器和 边缘ngram令牌过滤器之间的区别。
As I am new to elastic search, I am not able to identify difference between ngram token filter and edge ngram token filter.
在处理令牌中,这两个不同之处如何?
How these two differ from each other in processing tokens?
推荐答案我认为文档是非常清楚的:
这个分类器非常类似于nGram,但只保留从一开始就开始的n-gram令牌。
This tokenizer is very similar to nGram but only keeps n-grams which start at the beginning of a token.
而 nGram tokenizer的最佳示例再次来自文档:
And the best example for nGram tokenizer again comes from the documentation:
curl 'localhost:9200/test/_analyze?pretty=1&analyzer=my_ngram_analyzer' -d 'FC Schalke 04' # FC, Sc, Sch, ch, cha, ha, hal, al, alk, lk, lke, ke, 04使用此分类器定义:
"type" : "nGram", "min_gram" : "2", "max_gram" : "3", "token_chars": [ "letter", "digit" ]简而言之:
- 根据配置,tokenizer将创建令牌。在这个例子中: FC , Schalke , 04 / li>
- nGram 生成最小 min_gram 大小和最大 max_gram 大小从输入文本。基本上,令牌被分割成小块,每个块都被固定在一个角色上(这个角色无关紧要,所有这些都会创建块)。
- edgeNGram 执行相同操作,但这些块总是从每个令牌的开头开始。基本上,这些块被固定在标记的开头。
- the tokenizer, depending on the configuration, will create tokens. In this example: FC, Schalke, 04.
- nGram generates groups of characters of minimum min_gram size and maximum max_gram size from an input text. Basically, the tokens are split into small chunks and each chunk is anchored on a character (it doesn't matter where this character is, all of them will create chunks).
- edgeNGram does the same but the chunks always start from the beginning of each token. Basically, the chunks are anchored at the beginning of the tokens.
对于与上述相同的文本, edgeNGram 生成: FC,Sc,Sch,Scha,Schal,04 。考虑文本中的每个单词,对于每个单词,第一个字符是起始点( F 从 FC $ c $来自 Schalke 和 0 从 04 )。
For the same text as above, an edgeNGram generates this: FC, Sc, Sch, Scha, Schal, 04. Every "word" in the text is considered and for every "word" the first character is the starting point (F from FC, S from Schalke and 0 from 04).
更多推荐
ngram令牌过滤器与ngram令牌过滤器有何不同?
发布评论