对于2.0版本的SpaCy的模型xx_ent_wiki_sm,提到了"WikiNER"数据集,这导致了文章从Wikipedia学习多语言命名实体识别".
For the model xx_ent_wiki_sm of 2.0 version of SpaCy there is mention of "WikiNER" dataset, which leads to article 'Learning multilingual named entity recognition from Wikipedia'.
是否有用于下载此类数据集的资源以重新训练该模型?还是用于Wikipedia转储处理的脚本?
Is there any resource for downloading of such dataset for retraining that model? Or script for Wikipedia dump processing?
推荐答案来自Joel(和我)的前研究人员小组的数据服务器似乎处于脱机状态: downloads.schwa/wikiner
The data server from Joel (and my) former researcher group seems to be offline: downloads.schwa/wikiner
我在这里找到了wp3文件的镜像,这是我在spaCy中使用的文件: github/dice-group/FOX/tree/master/input/Wikiner
I found a mirror of the wp3 files here, which are the ones I'm using in spaCy: github/dice-group/FOX/tree/master/input/Wikiner
要重新训练spaCy模型,您需要创建一个train/dev拆分(我将在线获取我的以进行直接比较,但现在...只是随机剪切),并使用.iob扩展名.然后使用:
To retrain the spaCy model, you'll need to create a train/dev split (I'll get mine online for direct comparison, but for now...just take a random cut), and name the files with the .iob extension. Then use:
spacy convert -n 10 /path/to/file.iob /output/directory-n 10参数对于在spaCy中使用非常重要:它将句子连接到每个包含10个句子的伪段落"中.这样,模型就可以知道文档可以包含多个句子.
The -n 10 argument is important for use in spaCy: it concatenates sentences into 'pseudo-paragraphs' of 10 sentences each. This lets the model learn that documents can come with multiple sentences.
更多推荐
SpaCy模型训练数据:WikiNER
发布评论