Spacy自训练中文词性标注模型

编程入门行业动态更新时间:2024-10-24 18:17:02

Spacy自训练中文<a href=https://www.elefans.com/category/jswz/34/1752467.html style= 词性标注模型"/>

Spacy自训练中文词性标注模型

2021/4/14

首先加载相关包并读入数据文件：

# 读入相关包
from __future__ import unicode_literals, print_function
import plac
import random
from pathlib import Path
import spacy
from spacy.training import Example
import jieba# 读入训练数据和测试数据
# 删除标点
def clean_str(str1):result = str1for i in ['\n','；','。']:result = result.replace(i,'')return resultwith open( 'test.txt', 'r',encoding='utf-8') as f:test = [clean_str(line) for line in f.readlines()]
with open( 'train.txt', 'r',encoding='utf=8') as f:train = [clean_str(line) for line in f.readlines()]

使用spacy中文包的分词功能进行分词，但是对于苹果公司这种专有名词则需要单独拿出来。

nlp1 = spacy.load('zh_core_web_sm')
proper_nouns = ['苹果公司']
nlp1.tokenizer.pkuseg_update_user_dict(proper_nouns)def word_split(string):doc1 = nlp1(string)return '/'.join([t.text for t in doc1])
train_word_split = [word_split(t) for t in train]
test_word_split = [word_split(t) for t in test]
print('训练集分词：',train_word_split,'\n')
print('测试集分词:',test_word_split)
print('------------------------------------------------------------------------------------')

展示：

首先设定词性，中文主要有主语、谓语、宾语、形容词、数词、副词、助词、介词。

分别对应：

名称	对应内容
主语	NOUN\PRON
谓语	VERB
宾语	NOUN\PRON
代词	PRON
形容词	ADJ
数词	NUM
副词	ADV
助词/介词	PART

建立字典对应词性：

TAG_MAP = {'N': {'pos': 'NOUN'},'V': {'pos': 'VERB'},'J': {'pos': 'ADJ'},'P': {'pos': 'PRON'},'M': {'pos': 'NUM'},'A': {'pos': 'ADV'},'R': {'pos': 'PART'}
}

根据分词结果进行标注：

训练集分词： ['我/有/一个/梦想', '我/有/一个/苹果', '很多/人/都/喜欢/吃/苹果', '苹果公司/的/市值/最/高/的', '苹果/的/营养/是/最好/的', '库克/是/苹果公司/的/CEO', '陕西/是/苹果/的/原产地/之一', '今天/天气/真/好', '不/想/工作'] 测试集分词: ['我/喜欢/吃/苹果', '市值/最/高/的/公司/是/苹果', '库克/是/一个/很/好的/CEO', '我/因为/今天/的/天气/不/想/工作']
------------------------------------------------------------------------------------

训练集：

PVMN，PVMN，JNAAVN，NRNAVR，NRNVJR，NVNRN，NVNRNR，JNAJ，AVN
PAVN，NAJRNVN，NVMAJN，PAJRNAVN

根据上述标记，转换为tags：

# 词性标记
# 转换tags
train_bz = 'PVMN，PVMN，JNAAVN，NRNAVR，NRNVJR，NVNRN，NVNRNR，JNAJ，AVN'
test_bz = 'PAVN，NAJRNVN，NVMAJN，PAJRNAVN'
train_bz_list = train_bz.split('，')
test_bz_list = train_bz.split('，')TRAIN_DATA = []
for i in range(len(train)):TRAIN_DATA.append((train[i],{"tags":list(train_bz_list[i])}))
print(TRAIN_DATA)

注解器编写：

# 注解器，设置一些参数
@plac.annotations(lang=("ISO Code of language to use", "option", "l", str),output_dir=("Optional output directory", "option", "o", Path),n_iter=("Number of training iterations", "option", "n", int))

# 训练模型与测试
def main(lang='en', output_dir='mymodel', n_iter=25):nlp = spacy.blank(lang)  ##创建一个空的模型，表示中文的模型。tagger = nlp.add_pipe('tagger')# 添加注释器for tag, values in TAG_MAP.items():print("tag:",tag)print("values:",values)tagger.add_label(tag)print("3:",tagger)# 训练模型optimizer = nlp.begin_training() ##模型初始化for i in range(n_iter):random.shuffle(TRAIN_DATA)  ##打乱列表losses = {}for text, annotations in TRAIN_DATA:example = Example.from_dict(nlp.make_doc(text), annotations)nlp.update([example], sgd=optimizer, losses=losses)print(losses)# test the trained modelprint('下面进行临时输入测试！--------------------------------------------------------------------')time.sleep(2)string1 = "库克经理是苹果公司的老板拥有百万财产"string2 = "学校的一点点奶茶店快开张了"test_text1 = word_split(string1)test_text2 = word_split(string2)doc1 = nlp(test_text1)doc2 = nlp(test_text2)print('Tags', [(t.text, t.tag_, t.pos_) for t in doc1])print('Tags', [(t.text, t.tag_, t.pos_) for t in doc2])# save model to output directoryprint('下面进行测试集测试！--------------------------------------------------------------------')time.sleep(2)if output_dir is not None:output_dir = Path(output_dir)if not output_dir.exists():output_dir.mkdir()nlp.to_disk(output_dir)print("模型保存为：", output_dir)# test the save modelprint("加载模型地址：", output_dir)nlp2 = spacy.load(output_dir)# 测试数据for text in test_word_split:doc = nlp2(text)print('Tags', [(t.text, t.tag_, t.pos_) for t in doc])

中文空模型不具备分词功能，若像英文空模型那样去训练会造成长度不匹配现象。

会报：

ValueError: [E971] Found incompatible lengths in `Doc.from_array`: 4 for the array and 6 for the Doc itself.

意思是文本长度与词性个数不同。数字会变是因为训练时打乱了排序。

而本实验目的主要是为了实现训练标注功能，因此分词功能仍然用zh_core_web_md-3.0.0-py3-none-any模型。

英文空模型通过空格直接分词，可以通过初始的分词模型先对训练集和测试集进行分词，然后再训练词性标注模型。

输入了两个句子：

    string1 = "库克经理是苹果公司的老板拥有百万财产"string2 = "学校的一点点奶茶店快开张了"

分词结果如下：

Tags [('库克', 'N', ''), ('经理', 'N', ''), ('是', 'V', ''), ('苹果公司', 'N', ''), ('的', 'R', ''), ('老板', 'N', ''), ('拥有', 'N', ''), ('百万', 'N', ''), ('财产', 'N', '')]ZTags [('学校', 'N', ''), ('的', 'R', ''), ('一点点', 'N', ''), ('奶茶', 'N', ''), ('店', 'V', ''), ('快', 'J', ''), ('开张', 'N', ''), ('了', 'N', '')]

设置了路径，可以对训练好的模型进行保存。

导入保存好的模型对测试集分词结果如下：

更多推荐

Spacy自训练中文词性标注模型

本文发布于:2024-03-10 18:07:54，感谢您对本站的认可！

本文链接:https://www.elefans.com/category/jswz/34/1728669.html