ccmt2019

编程入门行业动态更新时间:2024-10-14 06:17:31

ccmt2019

接下来就是使用bpe对语料进行编码，以解决部分的登录词。

首先，把所有的英文语料汇总起来放在同一个文件en.txt，把所有对应的中文平行语料也放在一起得到cn.txt。

注意合并后需要检查下这两个文件各个句子之间是否对齐。

然后使用subword-nmt工具进行bpe的编码。

__author__ = 'jmh081701'
import  osdirectory=".//corpus//"
input_file = directory +"en.txt"code_file = directory+"en_code.txt"
dict_file = directory+"en_vocb.txt"#第一步,训练bpe模型,get vocabulary
#os.system("subword-nmt learn-joint-bpe-and-vocab -i %s -o %s --write-vocabulary %s"%(input_file,o_file,dict_file))#第二步,使用bpe模型,apply bpe
#en
output_file = directory+"en"+".bpe.txt"
os.system("subword-nmt apply-bpe -i %s -c %s -o %s"%(input_file,code_file,output_file))
#cn
#output_file = directory+"cn" +".bpe.txt"
#input_file = directory + "cn.txt"
#code_file = directory +"cn_code.txt"
#dict_file =directory + "c_vocb.txt"
#os.system("subword-nmt apply-bpe -i %s -c %s -o %s"%(input_file,code_file,output_file))

结果上面的操作，会得到字典文件en_vocb.txt 和 cn_vocb.txt。

word2vec

接下来，再使用 word2vec 生成中英文bpe 语料的词向量。
先做英文的

__author__ = 'jmh081701'
from gensim import  models
dirent=".//corpus//"
corpus  ="en.bpe.txt"
vocb    ="en_vocb.txt"
savepath="en.word2vec"option={"unk": "UNK","eos": "</s>"}
sentences=[]
vocabulary=set()
embedding_size=400
with open(dirent+vocb) as fp:lines = fp.readlines()for line in lines:word= line.split(" ")[0]vocabulary.add(word)with open(dirent+corpus) as fp:lines = fp.readlines()for line in lines:line = line.split(" ")#1.把未登录词标记为unk#2.给每个句子添加一个eos标记for index in range(len(line)):line[index]=line[index].strip()if line[index] not in vocabulary:line[index]=option["unk"]line +=[option['eos']]sentences.append(line)#使用word2vec训练
print("#"*30)
print("Begin Train word2vec")
word2vec = models.Word2Vec(sentences,size=embedding_size,iter=20,workers=4)
word2vec.save(dirent+savepath)

然后做英文的。

更多推荐

ccmt2019

本文发布于:2024-02-06 14:09:56，感谢您对本站的认可！

本文链接:https://www.elefans.com/category/jswz/34/1749294.html