为什么Lucene使用maxDoc而不是numDocs来计算术语idf?

编程入门 行业动态 更新时间:2024-10-27 10:18:35
本文介绍了为什么Lucene使用maxDoc而不是numDocs来计算术语idf?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧! 问题描述

我发现这是在Lucene的相似性类 public float idf(术语,Searcher搜索者)方法javadoc:

I found this on Lucene's Similarity class public float idf(Term term, Searcher searcher) method javadoc:

请注意,使用Searcher.maxDoc()代替IndexReader#numDocs(),因为Searcher.docFreq(Term)也是使用,当后者是不准确时,Searcher.maxDoc()和同一方向也是如此。在中,Searcher.maxDoc()的计算效率更高。

Note that Searcher.maxDoc() is used instead of IndexReader#numDocs() because also Searcher.docFreq(Term) is used, and when the latter is inaccurate, so is Searcher.maxDoc(), and in the same direction. In addition, Searcher.maxDoc() is more efficient to compute.

这对我来说没有多大意义。这是否与IndexReader中的文档删除有关?

This does not quite make sense to me. Does this have something to do with Document deletion in an IndexReader?

推荐答案

是的,完全正确。每当文档被删除(或更新,因为Lucene中的更新只是一个删除后跟一个添加),文档将保留在索引中,直到这些段被合并,通常是通过索引优化。它不会被删除后的搜索返回,但它的条款仍会对idf评分产生影响。

Yes, exactly right. Whenever a document is deleted (or updated, since an update in Lucene is just a delete followed by an add), the document remains in the index until those segments are merged, often by an index optimize. It won't be returned by searches, having been deleted, but it's terms will still have an influence on idf scoring.

LuceneFAQ有一些与此相关的信息,特别是在删除答案的最后一段中,这特别针对maxDoc

The LuceneFAQ has some information related to this, particularly in the last paragraph of this answer on deletion, and this addressing maxDoc specifically

更多推荐

为什么Lucene使用maxDoc而不是numDocs来计算术语idf?

本文发布于:2023-10-24 06:03:13,感谢您对本站的认可!
本文链接:https://www.elefans.com/category/jswz/34/1523124.html
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。
本文标签:术语   而不是   maxDoc   Lucene   numDocs

发布评论

评论列表 (有 0 条评论)
草根站长

>www.elefans.com

编程频道|电子爱好者 - 技术资讯及电子产品介绍!