图谱——gensim.corpora"/>
【代码笔记】持续更新:知识图谱——gensim.corpora
第一次用到知识图谱,就从KingDom代码出发好好学习一下叭~
到达train阶段时,图特征已提取完毕,并形成.np文件。现阶段以这些.np文件为基础构建跨域的知识库。
获取路径
source_path = get_dataset_path(source_name, 'small')target_path1 = get_dataset_path(target_name, 'small')target_path2 = get_dataset_path(target_name, 'test')
🔴构建词典(一个word与int id之间的映射)
字典学习 :为普通稠密表达的样本找到合适的字典,将样本转化为合适的稀疏表达形式 从而使学习任务得以简化,模型复杂度得以降低 通常称为‘字典学习’(dictionary learning),亦称‘稀疏编码’(sparse coding) 22.4.5更
引用类:Dictionary (from gensim.corpora import Dictionary)
重要函数:
1. def add_documents(self, documents, prune_at=2000000)
将 'documents' 作为拓展更新字典
2.def doc2bow(self, document, allow_update=False, return_missing=False)
将'documents' 转化为BOW模式
'''
Parameters
----------
document : list of strInput document.
allow_update : bool, optionalUpdate self, 通过从“document”添加新标记并更新内部语料库统计信息。
return_missing : bool, optionalReturn missing tokens (tokens present in `document` but not in self) with frequenciesReturn
------
list of (int, int)BoW representation of `document`.
list of (int, int), dict of (str, int)If `return_missing` is True, return BoW representation of `document` + dictionary with missing tokens and their frequencies.'''
3. def doc2idx(self, document, unknown_word_index=-1): Convert `document` (a list of words) into a list of indexes = list of `token_id`.Replace all unknown words i.e, words not in the dictionary with the index as set via `unknown_word_index`.
4. def filter_extremes(self, no_below=5, no_above=0.5, keep_n=100000, keep_tokens=None): Filter out tokens in the dictionary by their frequency.
???我看不懂: keep_n : int, optional Keep only the first `keep_n` most frequent tokens.
调试下:
总结 :dictionary.filter_extremes(no_below=5, no_above=0.5, keep_n=100000)
1.去掉出现次数低于no_below的
2.去掉出现次数高于no_above的。注意这个小数指的是百分数
3.在1和2的基础上,保留出现频率前keep_n的单词,频率一致则参考id保留
更多参考:gensim使用方法以及例子
更多推荐
【代码笔记】持续更新:知识图谱——gensim.corpora
发布评论