索引文档中单词的最有效方法?

编程入门 行业动态 更新时间:2024-10-11 15:13:37
本文介绍了索引文档中单词的最有效方法?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧! 问题描述

这是另一个问题,但我认为最好将其作为一个单独的问题提出.给出大量的句子列表(10万个顺序):

This came up in another question but I figured it is best to ask this as a separate question. Give a large list of sentences (order of 100 thousands):

[ "This is sentence 1 as an example", "This is sentence 1 as another example", "This is sentence 2", "This is sentence 3 as another example ", "This is sentence 4" ]

编写以下函数的最佳方法是什么?

what is the best way to code the following function?

def GetSentences(word1, word2, position): return ""

在给定两个词word1,word2和位置position的情况下,该函数应返回满足该约束的所有语句的列表.例如:

where given two words, word1, word2 and a position position, the function should return the list of all sentences satisfying that constraint. For example:

GetSentences("sentence", "another", 3)

应返回句子1和3作为句子的索引.我目前的方法是使用像这样的字典:

should return sentences 1 and 3 as the index of the sentences. My current approach was using a dictionary like this:

Index = defaultdict(lambda: defaultdict(lambda: defaultdict(lambda: []))) for sentenceIndex, sentence in enumerate(sentences): words = sentence.split() for index, word in enumerate(words): for i, word2 in enumerate(words[index:): Index[word][word2][i+1].append(sentenceIndex)

但是,由于我的48GB RAM在不到5分钟的时间里用完了,这很快就使数据集上的所有数据超出了比例,该数据集大小约为130 MB.我以某种方式感到这是一个普遍的问题,但是找不到有关如何有效解决此问题的参考.关于如何解决这个问题有什么建议吗?

But this quickly blows everything out of proportion on a dataset that is about 130 MB in size as my 48GB RAM is exhausted in less than 5 minutes. I somehow get a feeling this is a common problem but can't find any references on how to solve this efficiently. Any suggestions on how to approach this?

推荐答案

使用数据库存储值.

  • 首先将所有句子添加到一张桌子(它们应具有ID).您可以这样称呼它. sentences.
  • 其次,在所有句子中包含单词的第二次创建带有单词的表(将其命名为words,为每个单词指定一个ID),从而将句子的表记录与单词的表记录之间的连接保存在其中单独的表(例如,称为sentences_words,它应该具有两列,最好是word_id和sentence_id).
  • 当搜索包含所有提到的单词的句子时,您的工作将得到简化:

  • First add all the sentences to one table (they should have IDs). You may call it eg. sentences.
  • Second, create table with words contained within all the sentences (call it eg. words, give each word an ID), saving connection between sentences' table records and words' table records within separate table (call it eg. sentences_words, it should have two columns, preferably word_id and sentence_id).
  • When searching for sentences containing all the mentioned words, your job will be simplified:

  • 您应该首先从words表中查找记录,其中的单词正是您要搜索的单词.查询看起来可能像这样:

  • You should first find records from words table, where words are exactly the ones you search for. The query could look like this: SELECT `id` FROM `words` WHERE `word` IN ('word1', 'word2', 'word3');

  • 第二,您应该从表sentences 中找到需要word_id值的sentence_id值(对应于words表中的单词).初始查询可能如下所示:

  • Second, you should find sentence_id values from table sentences that have required word_id values (corresponding to the words from words table). The initial query could look like this:

    SELECT `sentence_id`, `word_id` FROM `sentences_words` WHERE `word_id` IN ([here goes list of words' ids]);

    可以简化为:

    SELECT `sentence_id`, `word_id` FROM `sentences_words` WHERE `word_id` IN ( SELECT `id` FROM `words` WHERE `word` IN ('word1', 'word2', 'word3') );

  • 在Python中过滤结果,仅返回具有所需所有word_id ID的sentence_id值.

  • Filter the result within Python to return only sentence_id values that have all the required word_id IDs you need.

  • 这基本上是一种基于以最适合此形式的形式存储大量数据的解决方案-数据库.

    This is basically a solution based on storing big amount of data in the form that is best suited for this - the database.

  • 如果您仅搜索两个单词,则可以在DBMS方面做更多(几乎所有事情).
  • 考虑到您还需要位置差异,您应该将单词的位置存储在sentences_words表的第三列中(简称为position),并且在搜索适当的单词时,您应该计算该值的差异两个词.
  • If you will only search for two words, you can do even more (almost everything) on DBMS' side.
  • Considering you need also position difference, you should store the position of the word within third column of sentences_words table (lets call it just position) and when searching for appropriate words, you should calculate difference of this value associated with both words.
  • 更多推荐

    索引文档中单词的最有效方法?

    本文发布于:2023-10-11 17:28:16,感谢您对本站的认可!
    本文链接:https://www.elefans.com/category/jswz/34/1482350.html
    版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。
    本文标签:最有效   单词   索引   文档   方法

    发布评论

    评论列表 (有 0 条评论)
    草根站长

    >www.elefans.com

    编程频道|电子爱好者 - 技术资讯及电子产品介绍!