余弦相似度产生'nan'值

编程入门 行业动态 更新时间:2024-10-12 05:49:23
本文介绍了余弦相似度产生'nan'值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧! 问题描述

我正在为稀疏向量计算余弦相似度矩阵,而预期为浮点数的元素似乎是'nan'.

I was calculating a Cosine Similarity Matrix for sparse vectors, and the elements expected to be float numbers appeared to be 'nan'.

访问次数"是一个稀疏矩阵,显示每个用户访问过每个网站的次数.该矩阵以前的形状为1 500 000 x 1500,但是我使用coo_matrix().tocsc()将其转换为稀疏矩阵.

'visits' is a sparse matrix showing how many times each user has visited each website. This matrix used to have a shape 1 500 000 x 1500, but I converted it into sparse matrix, using coo_matrix().tocsc().

任务是找出网站的相似程度,因此我决定计算每个网站之间的余弦指标.

The task is to find out, how similar the websites are, so I decided to calculate the cosine metric between each two sites.

这是我的代码:

cosine_distance_matrix = np.ndarray(shape = (visits.shape[1], visits.shape[1])) def norm(x): return np.sqrt( x.T.dot(x) ) for i in range(0, visits.shape[1]): for k in range(0, i + 1): normi_normk = norm(visits[:,i]) * norm(visits[:,k]) cosine_distance_matrix[i,k] = visits[:,i].T.dot(visits[:, k])/normi_normk cosine_distance_matrix[k, i] = cosine_distance_matrix[i, k] print cosine_distance_matrix

这就是我得到的! O_o

And this is what I have gotten! O_o

[[ 1. nan nan ..., nan nan nan] [ nan 1. nan ..., nan nan nan] [ nan nan 1. ..., nan nan nan] ..., [ nan nan nan ..., 1. nan nan] [ nan nan nan ..., nan 1. nan] [ nan nan nan ..., nan nan 1.]]

该程序运行了3个小时...产生这样的垃圾而不是浮点数的原因是什么?

This program was running for 3 hours... What's the reason of such a trash instead of float numbers?

推荐答案

尝试:

def norm(x): return np.sqrt((x.T*x).A)

我构造了一个较小的示例visits矩阵,并使用您的代码计算了cosine_distance_matrix.我的对角线是1s,在对角线的对角线上有很多nan.我选择了nan项之一,并查看了相应的i,k计算.

I constructed a smaller sample visits matrix, and calculated cosine_distance_matrix with your code. Mine had the diagonal of 1s, and lots of nan on the off diagonal. I choose one of the nan items, and looked the the corresponding i,k calculation.

In [690]: normi_normk = norm(visits[:,i]) * norm(visits[:,k]) In [691]: normi_normk Out[691]: <1x1 sparse matrix of type '<class 'numpy.float64'>' with 1 stored elements in Compressed Sparse Column format> In [692]: normi_normk.A Out[692]: array([[ 18707.57953344]])

visits是稀疏矩阵,所以visits[:,i]也是稀疏矩阵(1列).您的norm函数返回一个1x1的稀疏矩阵.

visits is a sparse matrix, so visits[:,i] is also sparse matrix (1 column). Your norm function returns a 1x1 sparse matrix.

对于此对,此dot为0,但仍为1x1稀疏矩阵:

For this pair, this dot is 0, but it still a 1x1 sparse matrix:

In [718]: visits[:,i].T.dot(visits[:, k]) Out[718]: <1x1 sparse matrix of type '<class 'numpy.int32'>' with 0 stored elements in Compressed Sparse Column format>

这些稀疏矩阵的划分也很稀疏-和nan.

The division of these sparse matricies is also sparse - and nan.

In [717]: visits[:,i].T.dot(visits[:, k])/normi_normk Out[717]: matrix([[ nan]])

但是如果将normi_normk更改为标量或密集数组,我将得到0

But if I change normi_normk to a scalar or dense array I get 0

In [722]: visits[:,i].T.dot(visits[:, k])/normi_normk.A Out[722]: matrix([[ 0.]])

因此,我们必须将其从matrix/matrix除法更改为涉及密集数组或标量的内容.可以通过多种方式进行更改.重写norm以正确处理稀疏矩阵是一个.

So we have to change this from a matrix/matrix division, to something involving dense arrays or scalars. It can be changed in various ways. Rewriting the norm to handle sparse matrices correctly is one.

此外,我建议使用:

(visits[:,i].T*visits[:, k]).A/normi_normk

这样除法的两个项都是密集的.

so that both terms of the division are dense.

另一种可能性是使用visits[:,i].A和visits[:,k].A,因此内部循环计算是使用密集数组而不是这些矩阵进行的.

Another possibility is to use visits[:,i].A and visits[:,k].A, so the inner loop calculations are done with dense arrays rather than these matrices.

请注意,我没有做任何高级或特殊的事情.我只是详细研究了问题计算之一,并找到了nan的来源.

Note that I'm not doing anything advanced or special. I just examined in detail one of the problem calculations, and found the source of the nan.

我也建议使用np.zeros初始化数组.我只在正常的zeros,ones,empty不起作用时使用ndarray.

I would also suggest using np.zeros to initialize the array. I only use ndarray when the normal zeros, ones, empty don't work.

cosine_distance_matrix = np.zeros((visits.shape[1], visits.shape[1]))

总的来说,最好避免循环遍历i和k,使用矩阵乘积等来做所有事情.但是此修复程序将帮助您前进.

In the big picture it would best to avoid looping over i and k, doing everything with matrix products and such. But this fix will get you going.

更多推荐

余弦相似度产生'nan'值

本文发布于:2023-10-18 04:19:35,感谢您对本站的认可!
本文链接:https://www.elefans.com/category/jswz/34/1503079.html
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。
本文标签:余弦   nan

发布评论

评论列表 (有 0 条评论)
草根站长

>www.elefans.com

编程频道|电子爱好者 - 技术资讯及电子产品介绍!