为什么Lucene使用maxDoc而不是numDocs来计算术语idf？

编程入门行业动态更新时间:2024-10-27 10:18:35

本文介绍了为什么Lucene使用maxDoc而不是numDocs来计算术语idf？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！问题描述

我发现这是在Lucene的相似性类 public float idf（术语，Searcher搜索者）方法javadoc：

I found this on Lucene's Similarity class public float idf(Term term, Searcher searcher) method javadoc:

请注意，使用Searcher.maxDoc（）代替IndexReader＃numDocs（），因为Searcher.docFreq（Term）也是使用，当后者是不准确时，Searcher.maxDoc（）和同一方向也是如此。在中，Searcher.maxDoc（）的计算效率更高。

Note that Searcher.maxDoc() is used instead of IndexReader#numDocs() because also Searcher.docFreq(Term) is used, and when the latter is inaccurate, so is Searcher.maxDoc(), and in the same direction. In addition, Searcher.maxDoc() is more efficient to compute.

这对我来说没有多大意义。这是否与IndexReader中的文档删除有关？

This does not quite make sense to me. Does this have something to do with Document deletion in an IndexReader?

推荐答案

是的，完全正确。每当文档被删除（或更新，因为Lucene中的更新只是一个删除后跟一个添加），文档将保留在索引中，直到这些段被合并，通常是通过索引优化。它不会被删除后的搜索返回，但它的条款仍会对idf评分产生影响。

Yes, exactly right. Whenever a document is deleted (or updated, since an update in Lucene is just a delete followed by an add), the document remains in the index until those segments are merged, often by an index optimize. It won't be returned by searches, having been deleted, but it's terms will still have an influence on idf scoring.

LuceneFAQ有一些与此相关的信息，特别是在删除答案的最后一段中，这特别针对maxDoc

The LuceneFAQ has some information related to this, particularly in the last paragraph of this answer on deletion, and this addressing maxDoc specifically