python tokenizer 2个单词短语到word2vec模型(python tokenizer 2 words phrases to word2vec model)

系统教程行业动态更新时间:2024-06-14 16:57:18

我正在为word2vec使用python gensim包。

我想在令牌化单词和双字短语上运行模型。我有10,000~文件，我使用nltk Regextoknizer从所有文件中获取单字标记。如何对文档进行标记化以获得2个单词的短语。

例如：

文件：“我有一个青苹果”

和2个单词短语：{I_have}，{green_apple}，......等

I'm using python gensim package for word2vec.

I want to run the model on tokenize words and 2-words phrase. I have 10,000~ documents and I used the nltk Regextoknizer to get the single word tokens from all the documents. How can I tokenizer the document to get also the 2-words phrase.

For example:

document: "I have a green apple"

and the 2 word phrase: {I_have}, {green_apple}, ... etc.

最满意答案

一个选项是使用来自nltk并设置n = 2，以获取元组列表：

from nltk import ngrams n = 2 bigram_list = list(ngrams(document.split(), n))

one option is to use ngrams from nltk and set n=2 like this to get a list of tuples:

from nltk import ngrams n = 2 bigram_list = list(ngrams(document.split(), n))

更多推荐

本文发布于:2023-04-12 20:19:00，感谢您对本站的认可！

本文链接:https://www.elefans.com/category/dzcp/518759e53eae5d28ab9ecb6df116aa05.html

短语单词模型 python tokenizer

发布评论取消回复

评论列表（有 0 条评论）

python tokenizer 2个单词短语到word2vec模型(python tokenizer 2 words phrases to word2vec model)

最满意答案

发布评论取消回复

最近发表

热门文章

标签列表