我应该使用维基百科的文章链接数据转储从组织网站中提取代表性术语。 为了达到这个目标 -
现在,我应该使用字典处理每个网页,以识别术语并跟踪他们的术语ID&频率。
为了适应内存的字典,我将字典拆分为更小的文件。根据我对一个小型数据集的实验,上述处理时间大约为75天。
这仅适用于1个组织。
执行 -
- li>用于在内存中存储字典的HashMap
我试过优化代码并调整JVM更好的性能。
有人可以请一个更有效的方式来实现上述,减少处理时间几天。
是Hadoop的一个选项吗?
解决方案根据您的问题:
文件数量= 110000 字典=> [TermID,Title Terms]列表= 4千万条目 文件大小=平均每个文件11000 * 1KB = 26.9GB (平均每个文件1KB) 字典大小= 4000万* 256bytes = 9.5GB原始数据(平均每笔256bytes)你是如何到达75天的估计值的?
有很多表现目标:
- 您如何存储文档?
- 您如何存储/检索词典? (假设所有内容都不在内存中,除非您能负担得起)
- 您运行多少台机器?
- 是否正在执行字典并行查找? (当然,假设字典在你已经处理了整个维基百科后是不可变的)
下面是我相信你的概要做:
dictionary =读维基百科字典 document =一系列文件 documents.map {doc => (术语< - doc.terms.map if(dictionary.contains(term))){ docTermFreq = docTermFreq +(()) var docTermFreq = Map [String,Int] term> docTermFreq.getOrElse(term,0)+ 1)} // store docTermFreq map }这实质上是在将每个文档分解为令牌,然后在维基百科字典中查找它的令牌。
这正是 Lucene分析器。
A Lucene Tokenizer 会将文档转换为令牌。这是在术语索引到lucene之前发生的。因此,你所要做的就是实现一个分析器,它可以查找维基百科词典,了解令牌是否在字典中。
我会这样做:
- 为每个文档创建一个标记流(使用上面描述的分析器)
- 索引文档条款。
- 此时,您只会在Lucene索引中拥有维基百科条款。
- 文档频率
- TermFrequencyVector (正是你需要的)
- 和准备使用倒排索引! (快速介绍倒排索引和检索) ul>
- 并行化文档流处理。
- 您可以将字典存储在key-价值数据库,例如 BerkeylyDB 或Kyoto Cabinet,或者甚至是内存中的键值存储,例如 Redis 或 Memcache 。
- Crawled & downloaded organisation's webpages. (~110,000)
- Created a dictionary of wikipedia ID and terms/title. (~40million records)
- HashMap for storing dictionary in memory.
- looping through each map entry to search the term in a webpage, using Boyer-Moore search implementation.
- Repeating the above for each webpage, and storing results in a HashMap.
- How are you storing the Documents?
- How are you storing/retrieving the Dictionary? ( assuming not all of it in memory unless you can afford to)
- How many machines are you running it on?
- Are you performing the dictionary lookups in parallel? ( of-course assuming dictionary is immutable once you have already processed whole of wikipedia )
- Take every document and prepare a token stream ( using an Analyzer described above )
- Index the document terms.
- At this point you will have wikipedia terms only, in the Lucene Index.
- Document Frequency of a Term
- TermFrequencyVector ( exactly what you need )
- and a ready to use inverted index! ( for a quick introduction to Inverted Index and Retrieval )
- Parallelize the document stream processing.
- You can store the dictionary in key-value database such as BerkeylyDB or Kyoto Cabinet, or even an in-memory key-value storage such as Redis or Memcache.
当你这样做的时候,你可以从Lucene索引中获得现成的统计信息,例如: 您可以通过很多方法来提高性能。例如: 我希望有帮助。 I am supposed to extract representative terms from an organisation's website using wikipedia's article-link data dump.
To achieve this I've - Now, I'm supposed to process each of the webpages using the dictionary to recognise terms and track their term IDs & frequencies. For the dictionary to fit in memory, I've splitted the dictionary into smaller files. Based on my experiment with a small data-set, the processing time for the above will be around 75 days. And this is just for 1 organisation. I have to do the same for more than 40 of them. Implementation - I've tried optimizing the code and tuning the JVM for better performance. Can someone please advise on a more efficient way to implement the above, reducing the processing time to a few days. Is Hadoop an option to consider? Based on your question: How did you arrive at the 75 days estimate? There are number of performance targets: Here is an outline of what I believe you are doing: What this is essentially doing is breaking up each document into tokens and then performing a lookup in wikipedia dictionary for its token's existence. This is exactly what a Lucene Analyzer does. A Lucene Tokenizer will convert document into tokens. This happens before the terms are indexed into lucene. So all you have to do is implement a Analyzer which can lookup the Wikipedia Dictionary, for whether or not a token is in dictionary. I would do it like this: When you do this, you will have ready-made statistics from the Lucene Index such as: There are lot of things you can do to improve the performance. For example: I hope that helps.
更多推荐
如何提高使用维基百科数据和巨大号码时的性能。网页?
发布评论