给定每对单词的相似度，如何测量两个文档的相似度?

编程入门行业动态更新时间:2024-10-15 02:33:14

本文介绍了给定每对单词的相似度，如何测量两个文档的相似度?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！问题描述

例如，我有两个文档:

Doc1 = {'python','numpy','machine learning'} Doc2 = {'python','pandas','tensorflow','svm','regression','R'}

我也知道每对单词的similarity(相关性)，例如

Sim('python','python') = 1 Sim('python','pandas') = 0.8 Sim('numpy', 'R') = 0.1

衡量两个文档相似度的最佳方法是什么?

在这种情况下，传统的Jaccard distance和cosine distance似乎不是一个很好的指标.

解决方案

我喜欢彼得·克里斯滕(Peter Christen).

在这里，他描述了两组字符串之间的 Monge-Elkan 相似性度量. 对于第一个集合中的每个单词，您会找到第二个集合中最接近的单词，然后将其除以第一个集合中的元素数. 您可以在此处第30页的 .

I have two documents, for example:

Doc1 = {'python','numpy','machine learning'} Doc2 = {'python','pandas','tensorflow','svm','regression','R'}

And I also know the similarity(correlation) of each pair of words, e.g

Sim('python','python') = 1 Sim('python','pandas') = 0.8 Sim('numpy', 'R') = 0.1

What is the best way to measure the similarity of the two documents?

It seems that the traditional Jaccard distance and cosine distance are not a good metric in this situation.

解决方案

I like a book by Peter Christen on this issue.

Here he describes a Monge-Elkan similarity measure between two sets of strings. For each word from the first set you find the closest word from the second set and divide it by the number of elements in the first set. You can see its description on page 30 here.