在Lucene中索引一个txt文件(Indexing a txt file in Lucene)

我想为推文创建一个小型搜索引擎。我有一个包含20000个推文的txt文件。文件格式如下所示：

TommyFrench1 851 85170333395811123 Lurgan，Moira，Armagh。歌谣本周，我们对在商店中进行的四场欧冠比赛中的首位射门手感到双倍的喜悦。冠军联赛

Im_Aarkay 175 851703414300037122 巴黎 @ChampionsLeague @AS_Monaco @AS_Monaco_EN Nopes，这是当City击败欧冠时。。。等等

第一行是username ，其次是我有followers ，接下来是id和location ，最后是text(tweet) 。

我认为每一条推文都是一份文件。所以我必须有20000个文件，每个文件必须有5个字段（用户名，追随者，ID等）。

我如何制作索引？

我看过一些教程，但我没有找到类似的东西

编辑：这是我的代码。

import java.io.BufferedReader; import java.io.File; import java.io.FileReader; import java.io.IOException; import java.nio.file.Paths; import java.text.ParseException; import org.apache.lucene.analysis.Analyzer; import org.apache.lucene.analysis.standard.StandardAnalyzer; import org.apache.lucene.document.Document; import org.apache.lucene.document.Field; import org.apache.lucene.document.StringField; import org.apache.lucene.document.TextField; import org.apache.lucene.index.DirectoryReader; import org.apache.lucene.index.IndexReader; import org.apache.lucene.index.IndexWriter; import org.apache.lucene.index.IndexWriterConfig; import org.apache.lucene.queryparser.classic.QueryParser; import org.apache.lucene.search.IndexSearcher; import org.apache.lucene.search.Query; import org.apache.lucene.search.ScoreDoc; import org.apache.lucene.search.TopScoreDocCollector; import org.apache.lucene.store.Directory; import org.apache.lucene.store.FSDirectory; import org.apache.lucene.store.RAMDirectory; import org.apache.lucene.util.Version; public class MyProgram { public static void main(String[] args) throws IOException, ParseException { FileReader fileReader = new FileReader(new File("myfile.txt")); BufferedReader br = new BufferedReader(fileReader); String line = null; String indexPath = "C:\\Desktop\\myfolder"; Directory dir = FSDirectory.open(Paths.get(indexPath)); Analyzer analyzer = new StandardAnalyzer(); IndexWriterConfig iwc = new IndexWriterConfig(analyzer); IndexWriter writer = new IndexWriter(dir, iwc); while ((line = br.readLine()) != null) { // reading lines until the end of the file Document doc = new Document(); String username = br.readLine(); doc.add(new Field("username", username, Field.Store.YES, Field.Index.ANALYZED)); // adding title field String followers = br.readLine(); doc.add(new Field("followers", followers, Field.Store.YES, Field.Index.ANALYZED)); String id = br.readLine(); doc.add(new Field("id", id, Field.Store.YES, Field.Index.ANALYZED)); String location = br.readLine(); doc.add(new Field("location", location, Field.Store.YES, Field.Index.ANALYZED)); String text = br.readLine(); doc.add(new Field("text", text, Field.Store.YES, Field.Index.ANALYZED)); writer.addDocument(doc); // writing new document to the index br.readLine(); } } }

我得到以下错误： Index cannot be resolved or is not a field 。

我怎样才能解决这个问题？

I want to create a small search angine for tweets. I have a txt file with 20000 tweets. The file format is like:

TommyFrench1 851 85170333395811123 Lurgan, Moira, Armagh. Derry This week we are double delight on first goalscorers on the four Champions League matches in shop. ChampionsLeague

Im_Aarkay 175 851703414300037122 Paris @ChampionsLeague @AS_Monaco @AS_Monaco_EN Nopes, it's when City knocked outta Champions league. . . etc

The first line is the username, secondly I have the followers, next is the id and the location and last is the text(tweet).

I think that every tweet is a document. So i must have 20000 documents and every document must have 5 fields(username,followers,id etc).

How can i make the indexing?

I have seen some tutorials but i didn't found something similar

EDIT: Here is my code.

import java.io.BufferedReader; import java.io.File; import java.io.FileReader; import java.io.IOException; import java.nio.file.Paths; import java.text.ParseException; import org.apache.lucene.analysis.Analyzer; import org.apache.lucene.analysis.standard.StandardAnalyzer; import org.apache.lucene.document.Document; import org.apache.lucene.document.Field; import org.apache.lucene.document.StringField; import org.apache.lucene.document.TextField; import org.apache.lucene.index.DirectoryReader; import org.apache.lucene.index.IndexReader; import org.apache.lucene.index.IndexWriter; import org.apache.lucene.index.IndexWriterConfig; import org.apache.lucene.queryparser.classic.QueryParser; import org.apache.lucene.search.IndexSearcher; import org.apache.lucene.search.Query; import org.apache.lucene.search.ScoreDoc; import org.apache.lucene.search.TopScoreDocCollector; import org.apache.lucene.store.Directory; import org.apache.lucene.store.FSDirectory; import org.apache.lucene.store.RAMDirectory; import org.apache.lucene.util.Version; public class MyProgram { public static void main(String[] args) throws IOException, ParseException { FileReader fileReader = new FileReader(new File("myfile.txt")); BufferedReader br = new BufferedReader(fileReader); String line = null; String indexPath = "C:\\Desktop\\myfolder"; Directory dir = FSDirectory.open(Paths.get(indexPath)); Analyzer analyzer = new StandardAnalyzer(); IndexWriterConfig iwc = new IndexWriterConfig(analyzer); IndexWriter writer = new IndexWriter(dir, iwc); while ((line = br.readLine()) != null) { // reading lines until the end of the file Document doc = new Document(); String username = br.readLine(); doc.add(new Field("username", username, Field.Store.YES, Field.Index.ANALYZED)); // adding title field String followers = br.readLine(); doc.add(new Field("followers", followers, Field.Store.YES, Field.Index.ANALYZED)); String id = br.readLine(); doc.add(new Field("id", id, Field.Store.YES, Field.Index.ANALYZED)); String location = br.readLine(); doc.add(new Field("location", location, Field.Store.YES, Field.Index.ANALYZED)); String text = br.readLine(); doc.add(new Field("text", text, Field.Store.YES, Field.Index.ANALYZED)); writer.addDocument(doc); // writing new document to the index br.readLine(); } } }

Im getting the following error: Index cannot be resolved or is not a field.

How can i fix this?

最满意答案

从你的问题中很难解释你实际上面临着编译时错误而不是运行时错误。

我不得不复制你的代码，以了解它的编译时错误Field.Index.ANALYZED构造函数的Field.Index.ANALYZED参数。

参考文档，6.5.0中不再有这样的构造函数。

这是人们使用SOLR等顶级工具的原因之一，因为这些变化一直在低Lucene API中发生。

无论如何，在上述文件中，它也提到你的确如此，

专家：直接为文档创建一个字段。大多数用户应该使用一个糖子类：

对于你的情况， TextField和StringField是相关的类 - 两者之间存在细微的差异。

所以我会使用像 - new StringField(fieldName, fieldValue, Store.YES)等构造函数new StringField(fieldName, fieldValue, Store.YES)而不是直接在Field做。

你也可以使用Field也是 - new Field(fieldName, fieldValue, fieldType) ，其中fieldType是一个FieldType 。

您可以初始化FieldType如 - FieldType txtFieldType = new FieldType(TextField.TYPE_STORED)或FieldType strFieldType = new FieldType(StringField.TYPE_STORED)等。

总而言之，他们在Lucene中创建一个Field方式在最近的版本中已经发生了变化，因此请根据正在使用的Lucene版本的文档创建Field实例。

类似于 - doc.add(new Field("username", username, new FieldType(TextField.TYPE_STORED)))等

Its very hard to interpret from your question that you in fact facing a compile time error and not run time error.

I had to copy your code to understand that its a compile time error on - Field.Index.ANALYZED argument on Field constructor.

Refer Documentation and there are no such constructors in 6.5.0 anymore.

This is one of the reasons that folks use top level tools like SOLR etc because these kind of changes keep happening in low Lucene API.

Anyway, in above documentation, its also mentioned that you do ,

Expert: directly create a field for a document. Most users should use one of the sugar subclasses:

For your case, TextField and StringField are relevant classes - there is a subtle difference the two.

So I would use a constructor like - new StringField(fieldName, fieldValue, Store.YES) etc instead of directly doing on Field.

You can use Field also like - new Field(fieldName, fieldValue, fieldType) where fieldType is a FieldType.

You can initialize FieldType like - FieldType txtFieldType = new FieldType(TextField.TYPE_STORED) OR FieldType strFieldType = new FieldType(StringField.TYPE_STORED) etc.

All in all, they way you create a Field in Lucene has changed in recent versions so create your Field instances as per documentation of Lucene version being used.

Something like - doc.add(new Field("username", username, new FieldType(TextField.TYPE_STORED))) etc.

更多推荐