在Lucene中使用不同的相似性获得相同的结果(Getting the same results using different similarities in Lucene)

我们在Java中使用Lucene来搜索文档并找出它们是否相关。我们正在寻找6种不同的方式：

VSM与Porter词干和停用词的相似性 VSM与Porter词干分析器的相似性，没有停用词 VSM与标准词干和停用词的相似性 BM25与Porter词干和停用词的相似性 BM25与Porter stemmer的相似性，没有停用词 BM25与标准词干和停用词的相似性

搜索配置3和6的结果相同，配置1,2,4和5的结果也相同。这表明只更改分析仪（词干分析器）会改变任何东西。

我们已经尝试调试它以检查对象是否是我们期望的对象，但一切似乎都是有序的 - 只是对象的行为与我们希望的不同。我们还记得在索引和搜索时使用相同的相似性。

我们做错了什么？我们是否缺少一些正确“应用”配置的代码？

public IndexWriterConfig index(List<DocumentInCollection> docs) throws IOException { Analyzer analyz; IndexWriterConfig config; if (analyzer.equals("vsm") && stopwords && stemmer) { //VSM cosine similarity with TFIDF + stopwords + stemmer CharArraySet stopWords = EnglishAnalyzer.getDefaultStopSet(); analyz = new EnglishAnalyzer(stopWords); config = new IndexWriterConfig(analyz); config.setSimilarity(new ClassicSimilarity()); } else if (analyzer.equals("vsm") && !stopwords && stemmer) { //VSM cosine similarity with TFIDF - stopwords + stemmer analyz = new EnglishAnalyzer(CharArraySet.EMPTY_SET); config = new IndexWriterConfig(analyz); config.setSimilarity(new ClassicSimilarity()); } else if (analyzer.equals("vsm") && stopwords && !stemmer) { //VSM cosine similarity with TFIDF - stopwords - stemmer CharArraySet stopWords = StandardAnalyzer.STOP_WORDS_SET; analyz = new StandardAnalyzer(stopWords); config = new IndexWriterConfig(analyz); config.setSimilarity(new ClassicSimilarity()); } else if (analyzer.equals("bm25") && stopwords && stemmer) { //Analyzer + stopwords + stemmer CharArraySet stopWords = EnglishAnalyzer.getDefaultStopSet(); analyz = new EnglishAnalyzer(stopWords); config = new IndexWriterConfig(analyz); //BM25 ranking method config.setSimilarity(new BM25Similarity()); } else if (analyzer.equals("bm25") && !stopwords && stemmer) { //Analyzer - stopwords + stemmer analyz = new EnglishAnalyzer(CharArraySet.EMPTY_SET); config = new IndexWriterConfig(analyz); //BM25 ranking method config.setSimilarity(new BM25Similarity()); } else if (analyzer.equals("bm25") && stopwords && !stemmer) { //Analyzer + stopwords - stemmer CharArraySet stopWords = StandardAnalyzer.STOP_WORDS_SET; analyz = new StandardAnalyzer(stopWords); config = new IndexWriterConfig(analyz); //BM25 ranking method config.setSimilarity(new BM25Similarity()); } else { //some default analyz = new StandardAnalyzer(); config = new IndexWriterConfig(analyz); config.setSimilarity(new ClassicSimilarity()); } IndexWriter w = new IndexWriter(corpus, config); //total 153 documents with group 5 for (DocumentInCollection doc1 : docs) { if (doc1.getSearchTaskNumber() == 5) { Document doc = new Document(); doc.add(new TextField("title", doc1.getTitle(), Field.Store.YES)); doc.add(new TextField("abstract_text", doc1.getAbstractText(), Field.Store.YES)); doc.add(new TextField("relevance", Boolean.toString(doc1.isRelevant()), Field.Store.YES)); w.addDocument(doc); totalDocs++; if (doc1.isRelevant()) relevantDocs++; } } w.close(); return config; } public List<String> search(String searchQuery, IndexWriterConfig cf) throws IOException { printQuery(searchQuery); List<String> results = new LinkedList<String>(); //Constructing QueryParser to stem search query QueryParser qp = new QueryParser("abstract_text", cf.getAnalyzer()); Query stemmedQuery = null; try { stemmedQuery = qp.parse(searchQuery); } catch (ParseException e) { e.printStackTrace(); } // opening directory for search IndexReader reader = DirectoryReader.open(corpus); // implementing search over IndexReader IndexSearcher searcher = new IndexSearcher(reader); searcher.setSimilarity(cf.getSimilarity()); // finding top totalDocs documents qualifying the search TopDocs docs = searcher.search(stemmedQuery, totalDocs); // representing array of hits from TopDocs ScoreDoc[] scored = docs.scoreDocs; // adding matched doc titles to results for (ScoreDoc aDoc : scored) { Document d = searcher.doc(aDoc.doc); retrieved++; //relevance and score are printed out for debug purposes if (d.get("relevance").equals("true")) { relevantRetrieved++; results.add("+ " + d.get("title") + " | relevant: " + d.get("relevance") + " | score: " + aDoc.score); } else { results.add("- " + d.get("title") + " | relevant: " + d.get("relevance") + " | score: " + aDoc.score); } } return results; }

We're using Lucene in Java to search for documents and find out if they're relevant or not. We're searching in 6 different ways:

VSM Similarity with Porter stemmer and stop words VSM Similarity with Porter stemmer and no stop words VSM Similarity with standard stemmer and stop words BM25 Similarity with Porter stemmer and stop words BM25 Similarity with Porter stemmer and no stop words BM25 Similarity with standard stemmer and stop words

Results from search configuration 3 and 6 are the same, and the results from configuration 1, 2, 4 and 5 are also the same. This indicates that only changing the analyzer (the stemmer) alters anything.

We've tried debugging it to check if the objects are what we expect them to be, but everything seems to be in order - just the objects behaving differently than we hope. We are also remembering to use the same similarity when indexing and searching.

What are we doing wrong? Are we missing some code to 'apply' the configuration properly?

public IndexWriterConfig index(List<DocumentInCollection> docs) throws IOException { Analyzer analyz; IndexWriterConfig config; if (analyzer.equals("vsm") && stopwords && stemmer) { //VSM cosine similarity with TFIDF + stopwords + stemmer CharArraySet stopWords = EnglishAnalyzer.getDefaultStopSet(); analyz = new EnglishAnalyzer(stopWords); config = new IndexWriterConfig(analyz); config.setSimilarity(new ClassicSimilarity()); } else if (analyzer.equals("vsm") && !stopwords && stemmer) { //VSM cosine similarity with TFIDF - stopwords + stemmer analyz = new EnglishAnalyzer(CharArraySet.EMPTY_SET); config = new IndexWriterConfig(analyz); config.setSimilarity(new ClassicSimilarity()); } else if (analyzer.equals("vsm") && stopwords && !stemmer) { //VSM cosine similarity with TFIDF - stopwords - stemmer CharArraySet stopWords = StandardAnalyzer.STOP_WORDS_SET; analyz = new StandardAnalyzer(stopWords); config = new IndexWriterConfig(analyz); config.setSimilarity(new ClassicSimilarity()); } else if (analyzer.equals("bm25") && stopwords && stemmer) { //Analyzer + stopwords + stemmer CharArraySet stopWords = EnglishAnalyzer.getDefaultStopSet(); analyz = new EnglishAnalyzer(stopWords); config = new IndexWriterConfig(analyz); //BM25 ranking method config.setSimilarity(new BM25Similarity()); } else if (analyzer.equals("bm25") && !stopwords && stemmer) { //Analyzer - stopwords + stemmer analyz = new EnglishAnalyzer(CharArraySet.EMPTY_SET); config = new IndexWriterConfig(analyz); //BM25 ranking method config.setSimilarity(new BM25Similarity()); } else if (analyzer.equals("bm25") && stopwords && !stemmer) { //Analyzer + stopwords - stemmer CharArraySet stopWords = StandardAnalyzer.STOP_WORDS_SET; analyz = new StandardAnalyzer(stopWords); config = new IndexWriterConfig(analyz); //BM25 ranking method config.setSimilarity(new BM25Similarity()); } else { //some default analyz = new StandardAnalyzer(); config = new IndexWriterConfig(analyz); config.setSimilarity(new ClassicSimilarity()); } IndexWriter w = new IndexWriter(corpus, config); //total 153 documents with group 5 for (DocumentInCollection doc1 : docs) { if (doc1.getSearchTaskNumber() == 5) { Document doc = new Document(); doc.add(new TextField("title", doc1.getTitle(), Field.Store.YES)); doc.add(new TextField("abstract_text", doc1.getAbstractText(), Field.Store.YES)); doc.add(new TextField("relevance", Boolean.toString(doc1.isRelevant()), Field.Store.YES)); w.addDocument(doc); totalDocs++; if (doc1.isRelevant()) relevantDocs++; } } w.close(); return config; } public List<String> search(String searchQuery, IndexWriterConfig cf) throws IOException { printQuery(searchQuery); List<String> results = new LinkedList<String>(); //Constructing QueryParser to stem search query QueryParser qp = new QueryParser("abstract_text", cf.getAnalyzer()); Query stemmedQuery = null; try { stemmedQuery = qp.parse(searchQuery); } catch (ParseException e) { e.printStackTrace(); } // opening directory for search IndexReader reader = DirectoryReader.open(corpus); // implementing search over IndexReader IndexSearcher searcher = new IndexSearcher(reader); searcher.setSimilarity(cf.getSimilarity()); // finding top totalDocs documents qualifying the search TopDocs docs = searcher.search(stemmedQuery, totalDocs); // representing array of hits from TopDocs ScoreDoc[] scored = docs.scoreDocs; // adding matched doc titles to results for (ScoreDoc aDoc : scored) { Document d = searcher.doc(aDoc.doc); retrieved++; //relevance and score are printed out for debug purposes if (d.get("relevance").equals("true")) { relevantRetrieved++; results.add("+ " + d.get("title") + " | relevant: " + d.get("relevance") + " | score: " + aDoc.score); } else { results.add("- " + d.get("title") + " | relevant: " + d.get("relevance") + " | score: " + aDoc.score); } } return results; }

最满意答案

首先，您通常不会期望BM25和经典相似性返回不同的结果集，只是不同的分数（从而排序）。通常，相似性控制如何计算已经找到与查询匹配的文档的分数。它们通常会返回相同的结果，但分数不同，因此顺序不同。

如果你在bm25和vsm设置中看到相同的分数，那么是的，出了点问题。但是，基于我的精简版，可运行的测试版本，您的代码对我来说没问题： https ： //gist.github.com/anonymous/baf279806702edb54fab23db6d8d19b9

StopWord过滤器通常不是那么大的变化。它控制是否将停用词编入索引。停用词是“the”和“this”之类的词。使用停用词过滤器，它们不会被编入索引，也无法搜索。除非你正在寻找一个停用词，否则差异通常不会很明显。同样，根据我的测试版本，这似乎正常工作。

Firstly, you wouldn't usually expect BM25 and Classic Similarities to return a different set of results, just different scores (and thus ordering). Generally, the similarity governs how scores are calculated for documents that have already been found to match the query. They will typically return the same results, but with different scores, and so in a different order.

If you are seeing the same scores with the bm25 and vsm settings, then yes, something is going wrong. However, your code looks okay to me, based on my trimmed down, runnable test version: https://gist.github.com/anonymous/baf279806702edb54fab23db6d8d19b9

The StopWord filter really often isn't that big a change. It governs whether stop words are indexed. Stop words are words like "the" and "this". With the stop word filter they aren't indexed, and can't be searched. Unless you are searching for a stop word, the difference generally won't really be obvious. Again, this seems to be working correctly based on my test version.

更多推荐