余弦相似度和LDA主题

编程入门行业动态更新时间:2024-10-12 01:22:26

本文介绍了余弦相似度和LDA主题的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！问题描述

我想计算LDA主题之间的余弦相似度.实际上，gensim函数.matutils.cossim可以做到，但是我不知道我可以为该函数使用哪个参数(向量)?

I want to compute Cosine Similarity between LDA topics. In fact, gensim function .matutils.cossim can do it but I dont know which parameter (vector ) I can use for this function?

下面是一段代码:

import numpy as np import lda from sklearn.feature_extraction.text import CountVectorizer cvectorizer = CountVectorizer(min_df=4, max_features=10000, stop_words='english') cvz = cvectorizer.fit_transform(tweet_texts_processed) n_topics = 8 n_iter = 500 lda_model = lda.LDA(n_topics=n_topics, n_iter=n_iter) X_topics = lda_model.fit_transform(cvz) n_top_words = 6 topic_summaries = [] topic_word = lda_model.topic_word_ # get the topic words vocab = cvectorizer.get_feature_names() for i, topic_dist in enumerate(topic_word): topic_words = np.array(vocab)[np.argsort(topic_dist)][:-(n_top_words+1):-1] topic_summaries.append(' '.join(topic_words)) print('Topic {}: {}'.format(i, ' '.join(topic_words))) doc_topic = lda_model.doc_topic_ lda_keys = [] for i, tweet in enumerate(tweets): lda_keys += [X_topics[i].argmax()] import gensim from gensim import corpora, models, similarities #Cosine Similarity between LDA topics **sim = gensim.matutils.cossim(LDA_topic[1], LDA_topic[2])**

推荐答案

您可以使用单词主题分布向量.您需要两个主题向量都具有相同的维，并且元组的第一个元素为int，第二个元素为float.

You can use word-topic distribution vector. You need both topic vectors to be with the same dimension, and have first element of tuple to be int, and second - float.

vec1((int，浮点数)列表)

第一个元素是word_id，您可以在模型的id2word变量中找到它.如果您有两个模型，则需要合并字典.您的向量必须是:

So first element is word_id, that you can find in id2word variable in model. If you have two models, you need to union dictionaries. Your vectors must be:

[(1, 0.541223), (2, 0.44123)]

然后您可以比较它们.

更多推荐

余弦相似度和LDA主题

本文发布于:2023-10-18 04:19:48，感谢您对本站的认可！

本文链接:https://www.elefans.com/category/jswz/34/1503081.html