【代码笔记】持续更新：知识图谱——gensim.corpora

编程入门行业动态更新时间:2024-10-27 16:37:01

【代码笔记】持续更新：知识<a href=https://www.elefans.com/category/jswz/34/1769126.html style= 图谱——gensim.corpora"/>

【代码笔记】持续更新：知识图谱——gensim.corpora

第一次用到知识图谱，就从KingDom代码出发好好学习一下叭~

到达train阶段时，图特征已提取完毕，并形成.np文件。现阶段以这些.np文件为基础构建跨域的知识库。

获取路径

    source_path  = get_dataset_path(source_name, 'small')target_path1 = get_dataset_path(target_name, 'small')target_path2 = get_dataset_path(target_name, 'test')

🔴构建词典（一个word与int id之间的映射）

字典学习 ：为普通稠密表达的样本找到合适的字典，将样本转化为合适的稀疏表达形式从而使学习任务得以简化，模型复杂度得以降低通常称为‘字典学习’（dictionary learning），亦称‘稀疏编码’（sparse coding） 22.4.5更

引用类：Dictionary （from gensim.corpora import Dictionary）

重要函数：

1. def add_documents(self, documents, prune_at=2000000)

将 'documents' 作为拓展更新字典

2.def doc2bow(self, document, allow_update=False, return_missing=False)

将'documents' 转化为BOW模式

'''
Parameters
----------
document : list of strInput document.
allow_update : bool, optionalUpdate self, 通过从“document”添加新标记并更新内部语料库统计信息。
return_missing : bool, optionalReturn missing tokens (tokens present in `document` but not in self) with frequenciesReturn
------
list of (int, int)BoW representation of `document`.
list of (int, int), dict of (str, int)If `return_missing` is True, return BoW representation of `document` + dictionary with missing tokens and their frequencies.'''

3. def doc2idx(self, document, unknown_word_index=-1): Convert `document` (a list of words) into a list of indexes = list of `token_id`.Replace all unknown words i.e, words not in the dictionary with the index as set via `unknown_word_index`.

4. def filter_extremes(self, no_below=5, no_above=0.5, keep_n=100000, keep_tokens=None): Filter out tokens in the dictionary by their frequency.

???我看不懂： keep_n : int, optional Keep only the first `keep_n` most frequent tokens.

调试下：

                总结：dictionary.filter_extremes(no_below=5, no_above=0.5, keep_n=100000)
                             1.去掉出现次数低于no_below的
                             2.去掉出现次数高于no_above的。注意这个小数指的是百分数
                             3.在1和2的基础上，保留出现频率前keep_n的单词，频率一致则参考id保留

更多参考：gensim使用方法以及例子

更多推荐

【代码笔记】持续更新：知识图谱——gensim.corpora

本文发布于:2023-06-13 04:32:48，感谢您对本站的认可！

本文链接:https://www.elefans.com/category/jswz/34/674171.html