如何提高使用维基百科数据和巨大号码时的性能。网页？

编程入门行业动态更新时间:2024-10-11 11:13:53

本文介绍了如何提高使用维基百科数据和巨大号码时的性能。网页？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！问题描述

我应该使用维基百科的文章链接数据转储从组织网站中提取代表性术语。为了达到这个目标 -

检索&下载组织的网页。（〜110,000）

创建维基百科ID和术语/标题的字典。（约4000万条记录）

现在，我应该使用字典处理每个网页，以识别术语并跟踪他们的术语ID&频率。

为了适应内存的字典，我将字典拆分为更小的文件。根据我对一个小型数据集的实验，上述处理时间大约为75天。

这仅适用于1个组织。

执行 -

li>用于在内存中存储字典的HashMap
使用Boyer-Moore搜索实现循环每个映射条目以在网页中搜索术语。 >为每个网页重复上述操作，并将结果存储在HashMap中。

我试过优化代码并调整JVM更好的性能。
有人可以请一个更有效的方式来实现上述，减少处理时间几天。

是Hadoop的一个选项吗？
解决方案
根据您的问题：
文件数量= 110000 字典=> [TermID，Title Terms]列表= 4千万条目文件大小=平均每个文件11000 * 1KB = 26.9GB （平均每个文件1KB）字典大小= 4000万* 256bytes = 9.5GB原始数据（平均每笔256bytes）

你是如何到达75天的估计值的？

有很多表现目标：

您如何存储文档？
您如何存储/检索词典？（假设所有内容都不在内存中，除非您能负担得起）
您运行多少台机器？
是否正在执行字典并行查找？（当然，假设字典在你已经处理了整个维基百科后是不可变的）

下面是我相信你的概要做：

dictionary =读维基百科字典 document =一系列文件 documents.map {doc => （术语< - doc.terms.map if（dictionary.contains（term）））{ docTermFreq = docTermFreq +（（）） var docTermFreq = Map [String，Int] term> docTermFreq.getOrElse（term，0）+ 1）} // store docTermFreq map }

这实质上是在将每个文档分解为令牌，然后在维基百科字典中查找它的令牌。

这正是 Lucene分析器。

A Lucene Tokenizer 会将文档转换为令牌。这是在术语索引到lucene之前发生的。因此，你所要做的就是实现一个分析器，它可以查找维基百科词典，了解令牌是否在字典中。

我会这样做：

为每个文档创建一个标记流（使用上面描述的分析器）
索引文档条款。
此时，您只会在Lucene索引中拥有维基百科条款。

当你这样做的时候，你可以从Lucene索引中获得现成的统计信息，例如：

文档频率

TermFrequencyVector （正是你需要的）

和准备使用倒排索引！（快速介绍倒排索引和检索）
ul>
您可以通过很多方法来提高性能。例如：

并行化文档流处理。
您可以将字典存储在key-价值数据库，例如 BerkeylyDB 或Kyoto Cabinet，或者甚至是内存中的键值存储，例如 Redis 或 Memcache 。

我希望有帮助。

I am supposed to extract representative terms from an organisation's website using wikipedia's article-link data dump. To achieve this I've -

Crawled & downloaded organisation's webpages. (~110,000)

Created a dictionary of wikipedia ID and terms/title. (~40million records)

Now, I'm supposed to process each of the webpages using the dictionary to recognise terms and track their term IDs & frequencies.

For the dictionary to fit in memory, I've splitted the dictionary into smaller files. Based on my experiment with a small data-set, the processing time for the above will be around 75 days.

And this is just for 1 organisation. I have to do the same for more than 40 of them.

Implementation -

HashMap for storing dictionary in memory.

looping through each map entry to search the term in a webpage, using Boyer-Moore search implementation.

Repeating the above for each webpage, and storing results in a HashMap.

I've tried optimizing the code and tuning the JVM for better performance.

Can someone please advise on a more efficient way to implement the above, reducing the processing time to a few days.

Is Hadoop an option to consider?
解决方案
Based on your question:
Number of Documents = 110000 Dictionary => List of [TermID, Title Terms] = 40million entries Size of documents = 11000 * 1KB per document on an average = 26.9GB (1KB per document on an average) Size of dictionary = 40million * 256bytes = 9.5GB of raw data (256bytes per entry on an average)
How did you arrive at the 75 days estimate?

There are number of performance targets:

How are you storing the Documents?

How are you storing/retrieving the Dictionary? ( assuming not all of it in memory unless you can afford to)

How many machines are you running it on?

Are you performing the dictionary lookups in parallel? ( of-course assuming dictionary is immutable once you have already processed whole of wikipedia )

Here is an outline of what I believe you are doing:
dictionary = read wikipedia dictionary document = a sequence of documents documents.map { doc => var docTermFreq = Map[String, Int]() for(term <- doc.terms.map if(dictionary.contains(term)) ) { docTermFreq = docTermFreq + (term -> docTermFreq.getOrElse(term, 0) + 1) } // store docTermFreq map }
What this is essentially doing is breaking up each document into tokens and then performing a lookup in wikipedia dictionary for its token's existence.

This is exactly what a Lucene Analyzer does.

A Lucene Tokenizer will convert document into tokens. This happens before the terms are indexed into lucene. So all you have to do is implement a Analyzer which can lookup the Wikipedia Dictionary, for whether or not a token is in dictionary.

I would do it like this:

Take every document and prepare a token stream ( using an Analyzer described above )

Index the document terms.

At this point you will have wikipedia terms only, in the Lucene Index.

When you do this, you will have ready-made statistics from the Lucene Index such as:

Document Frequency of a Term

TermFrequencyVector ( exactly what you need )

and a ready to use inverted index! ( for a quick introduction to Inverted Index and Retrieval )

There are lot of things you can do to improve the performance. For example:

Parallelize the document stream processing.

You can store the dictionary in key-value database such as BerkeylyDB or Kyoto Cabinet, or even an in-memory key-value storage such as Redis or Memcache.

I hope that helps.

更多推荐

如何提高使用维基百科数据和巨大号码时的性能。网页？

本文发布于:2023-11-29 13:58:07，感谢您对本站的认可！

本文链接:https://www.elefans.com/category/jswz/34/1646583.html

版权声明:本站内容均来自互联网，仅供演示用，请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系，我们将在24小时内删除。

维基百科性能号码网页数据

上一篇：设备的最大号码发送到APNS插座SEVER

下一篇：通过串口发送大号

发布评论取消回复

评论列表（有 0 条评论）

最近发表

荆门网站建设的重要性

win10蓝屏终止代码CRITICAL_PROCESS_DIED解决方法

您可以尝试添加 --skip-broken 选项来解决该问题您可以尝试执行：rpm -Va --nofiles --nodigest 解决方案

关于无线网络波动大的解决办法

Windows10 关于系统中断CPU占用过高导致电脑变卡的解决办法

VS 2019 点击页面自动定位到解决方案资源管理器目录位置

（亲测解决）VMware打开需要半天才进入、打开系统很慢、运行很慢解决办法

Typora官网下载的最新版本mac10.13以下版本用不了的解决办法

成功解决ModuleNotFoundError: No module named ‘torch._C‘

MySQL:由于找不到VCRUNTIME140_1.dll，无法继续执行代码。重新安装程序可能会解决此问题

>www.elefans.com

编程频道|电子爱好者 - 技术资讯及电子产品介绍！

热门文章

从源“http://localhost:5173”访问“...”处的 XMLHttpRequest 已被 CORS 策略阻止

币安API错误代码1102，未发送强制参数“时间戳”

如果我在bot telegram nodejs中使用editMessageMedia，我如何制作标题

在 Node.js 中从网络流创建 blob

使用 Node.js / ES6 如何设置 dotenv 文件的自定义路径？

使用 NODE.JS 和 html5 实现低延迟（50 毫秒）视频流

如何从nodejs连接laravel>laravel

使用nodejs观看目录

如果文件包含特定字符串，如何跳过 GitHub 工作流程步骤？

FirebaseError：无法从.env加载环境变量

标签列表

文件

如何在

Python

系统

java

方法

数据

错误

windows

函数

android

linux

教程

如何使用

代码

字符串

计算机

电脑

服务器

NET

应用程序

数组

PHP

MySQL

SQL

对象

项目

程序

数据库

word