python tokenizer 2个单词短语到word2vec模型(python tokenizer 2 words phrases to word2vec model)
我正在为word2vec使用python gensim包。
我想在令牌化单词和双字短语上运行模型。 我有10,000~文件,我使用nltk Regextoknizer从所有文件中获取单字标记。 如何对文档进行标记化以获得2个单词的短语。
例如:
文件:“我有一个青苹果”
和2个单词短语:{I_have},{green_apple},......等
I'm using python gensim package for word2vec.
I want to run the model on tokenize words and 2-words phrase. I have 10,000~ documents and I used the nltk Regextoknizer to get the single word tokens from all the documents. How can I tokenizer the document to get also the 2-words phrase.
For example:
document: "I have a green apple"
and the 2 word phrase: {I_have}, {green_apple}, ... etc.
最满意答案
一个选项是使用来自nltk并设置n = 2,以获取元组列表:
from nltk import ngrams n = 2 bigram_list = list(ngrams(document.split(), n))one option is to use ngrams from nltk and set n=2 like this to get a list of tuples:
from nltk import ngrams n = 2 bigram_list = list(ngrams(document.split(), n))更多推荐
发布评论