如何使用elasticsearch正确处理多词同义词扩展?

编程入门 行业动态 更新时间:2024-10-04 23:31:40
本文介绍了如何使用elasticsearch正确处理多词同义词扩展?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧! 问题描述

我具有以下同义词扩展名:

suco => suco,refresco,bebida de soja

我想要的是用这种方式标记搜索:

搜索 suco de laranja将被标记为[ suco, laranja, refresco, bebida de soja]。

但我将其标记为[ suco, laranja, refresco, bebida, soja]。

请考虑 de 是停用词。我希望在 bebida de laranja成为[ bebida, laranja]之类的查询中被忽略。但是我不希望在同义词标记化上考虑它,因此 bebida de soja仍然保留为一个标记 bebida de soja。

我的设置:

{ settings:{ analysis:{ filter: { synonym_br:{ type:同义词,同义词:[ suco => suco,refresco,bebida de soja ] }, brazilian_stop:{ type: stop, stopwords: _ brazilian_ } }, analyzer:{ synonyms:{ filter:[ synonym_br, lowercase, brazilian_stop, asciifolding ], type: custom, tokenizer: standard } } } } }

解决方案

我建议您进行以下两项更改。第一个与您提出的问题直接相关,第二个与建议有关。

  • 不是使用多个同义词的扩展,相反,即所有同义词都指向一个单词的同义词。因此,将 suco => suco,refresco,bebida de soja 更改为 suco,refresco,bebida de soja => suco

  • 更改同义词分析器中的过滤器顺序。将小写放在 synonym_br 之前。这样可以确保大小写不会影响 synonym_br 令牌过滤器。

  • 因此最终设置将为:

    {设置:{分析 :{{ filter:{ synonym_br:{ type:同义词,同义词:[ suco,refresco, bebida de soja => suco ] }, brazilian_stop:{ type: stop, stopwords: _brazilian_ } }, analyzer:{ synonyms:{ filter:[ lowercase, synonym_br, brazilian_stop, asciifolding ], type: custom, tokenizer: standard } } } } }

    这是如何工作的?

    用于输入 b ebida de soja 过滤器按以下顺序应用:

    输入过滤器结果令牌 = ================================= 小写字母bebida,de,soja synonym_br suco< ------以上所有令牌(包括头寸)均与同义词 brazilian_stop suco asciifolding suco

    让我们看看 brazilian_stop 的作用。为此,我们需要输入一个与同义词不匹配,但其中包含 de 的输入。例如。 de soja :

    输入过滤器结果令牌 == ============================= 小写字母soja synonym_br de soja<- ------所有标记(独立或组合(包括位置))均不匹配任何同义词 brazilian_stop soja< ------- de被删除,因为它是一个停用词 asciifolding大豆

    I have the following synonym expansion :

    suco => suco, refresco, bebida de soja

    What i want is to tokenize the search this way:

    Search for "suco de laranja" would be tokenized to ["suco", "laranja", "refresco", "bebida de soja"].

    But i'm getting it tokenized to ["suco", "laranja", "refresco", "bebida", "soja"].

    Consider that the "de" word is a stop word. And i want it to be ignored on the query like "bebida de laranja" becomes ["bebida", "laranja"]. But i don't want it to be considered on the synonym tokenization so "bebida de soja" still stays as one token "bebida de soja".

    my settings :

    { "settings":{ "analysis":{ "filter":{ "synonym_br":{ "type":"synonym", "synonyms":[ "suco => suco, refresco, bebida de soja" ] }, "brazilian_stop":{ "type":"stop", "stopwords":"_brazilian_" } }, "analyzer":{ "synonyms":{ "filter":[ "synonym_br", "lowercase", "brazilian_stop", "asciifolding" ], "type":"custom", "tokenizer":"standard" } } } } }

    解决方案

    I would suggest you to make following two changes. First one directly relates to the question you asked and the second one is a suggestion.

  • Instead of using expansion of multiple synonyms, do the opposite i.e. all the synonyms points to a single word synonym. So, change "suco => suco, refresco, bebida de soja" to "suco, refresco, bebida de soja => suco"

  • Change the order of filters in synonyms analyzer. Place lowercase before synonym_br. This will ensure that case does't effect synonym_br token filter.

  • So final settings will be:

    { "settings": { "analysis": { "filter": { "synonym_br": { "type": "synonym", "synonyms": [ "suco, refresco, bebida de soja => suco" ] }, "brazilian_stop": { "type": "stop", "stopwords": "_brazilian_" } }, "analyzer": { "synonyms": { "filter": [ "lowercase", "synonym_br", "brazilian_stop", "asciifolding" ], "type": "custom", "tokenizer": "standard" } } } } }

    How does this work?

    For input bebida de soja filter apply in the following order:

    Input Filter Result tokens ==================================== lowercase bebida, de, soja synonym_br suco <------- all the above tokens(including position) exactly matches a synonym brazilian_stop suco asciifolding suco

    Let's see brazilian_stop in action. For this we need an input which doesn't match the synonym but have de in it. E.g. de soja:

    Input Filter Result tokens ================================= lowercase de, soja synonym_br de, soja <------- none of the tokens (independently or combined(including position)) matches any synonym brazilian_stop soja <------- de is removed as it is a stopword asciifolding soja

    更多推荐

    如何使用elasticsearch正确处理多词同义词扩展?

    本文发布于:2023-11-28 03:26:05,感谢您对本站的认可!
    本文链接:https://www.elefans.com/category/jswz/34/1640772.html
    版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。
    本文标签:同义词   如何使用   正确处理   elasticsearch

    发布评论

    评论列表 (有 0 条评论)
    草根站长

    >www.elefans.com

    编程频道|电子爱好者 - 技术资讯及电子产品介绍!