这是另一个问题,但我认为最好将其作为一个单独的问题提出.给出大量的句子列表(10万个顺序):
This came up in another question but I figured it is best to ask this as a separate question. Give a large list of sentences (order of 100 thousands):
[ "This is sentence 1 as an example", "This is sentence 1 as another example", "This is sentence 2", "This is sentence 3 as another example ", "This is sentence 4" ]编写以下函数的最佳方法是什么?
what is the best way to code the following function?
def GetSentences(word1, word2, position): return ""在给定两个词word1,word2和位置position的情况下,该函数应返回满足该约束的所有语句的列表.例如:
where given two words, word1, word2 and a position position, the function should return the list of all sentences satisfying that constraint. For example:
GetSentences("sentence", "another", 3)应返回句子1和3作为句子的索引.我目前的方法是使用像这样的字典:
should return sentences 1 and 3 as the index of the sentences. My current approach was using a dictionary like this:
Index = defaultdict(lambda: defaultdict(lambda: defaultdict(lambda: []))) for sentenceIndex, sentence in enumerate(sentences): words = sentence.split() for index, word in enumerate(words): for i, word2 in enumerate(words[index:): Index[word][word2][i+1].append(sentenceIndex)但是,由于我的48GB RAM在不到5分钟的时间里用完了,这很快就使数据集上的所有数据超出了比例,该数据集大小约为130 MB.我以某种方式感到这是一个普遍的问题,但是找不到有关如何有效解决此问题的参考.关于如何解决这个问题有什么建议吗?
But this quickly blows everything out of proportion on a dataset that is about 130 MB in size as my 48GB RAM is exhausted in less than 5 minutes. I somehow get a feeling this is a common problem but can't find any references on how to solve this efficiently. Any suggestions on how to approach this?
推荐答案使用数据库存储值.
当搜索包含所有提到的单词的句子时,您的工作将得到简化:
When searching for sentences containing all the mentioned words, your job will be simplified:
您应该首先从words表中查找记录,其中的单词正是您要搜索的单词.查询看起来可能像这样:
You should first find records from words table, where words are exactly the ones you search for. The query could look like this: SELECT `id` FROM `words` WHERE `word` IN ('word1', 'word2', 'word3');
第二,您应该从表sentences 中找到需要word_id值的sentence_id值(对应于words表中的单词).初始查询可能如下所示:
Second, you should find sentence_id values from table sentences that have required word_id values (corresponding to the words from words table). The initial query could look like this:
SELECT `sentence_id`, `word_id` FROM `sentences_words` WHERE `word_id` IN ([here goes list of words' ids]);可以简化为:
SELECT `sentence_id`, `word_id` FROM `sentences_words` WHERE `word_id` IN ( SELECT `id` FROM `words` WHERE `word` IN ('word1', 'word2', 'word3') );
在Python中过滤结果,仅返回具有所需所有word_id ID的sentence_id值.
Filter the result within Python to return only sentence_id values that have all the required word_id IDs you need.
这基本上是一种基于以最适合此形式的形式存储大量数据的解决方案-数据库.
This is basically a solution based on storing big amount of data in the form that is best suited for this - the database.
更多推荐
索引文档中单词的最有效方法?
发布评论