用于coo格式python的ML数据集的TFIDF表示(TFIDF representation for ML dataset in coo format python)

编程入门 行业动态 更新时间:2024-10-18 12:35:14
用于coo格式python的ML数据集的TFIDF表示(TFIDF representation for ML dataset in coo format python)

我想获取MovieLens标记数据集的tf-idf表示。 这些标签采用'coo'格式:

import pandas as pd ratings = pd.read_csv('data/ratings.csv',sep=',') movies = pd.read_csv('data/movies.csv',sep=',') tags = pd.read_csv('data/tags.csv',sep=',') print(tags) userId movieId tag \ 0 15 339 sandra 'boring' bullock 1 15 1955 dentist 2 15 7478 Cambodia 3 15 32892 Russian 4 15 34162 forgettable 5 15 35957 short 6 15 37729 dull story 7 15 45950 powerpoint 8 15 100365 activist 9 15 100365 documentary 10 15 100365 uganda 11 23 150 Ron Howard ...

我的tf-idf代码的第一个版本如下所示:

vectorizer = TfidfVectorizer(use_idf=True, norm= 'l2') X = vectorizer.fit_transform(tags['tag']) print(X) (0, 89) 0.603928505945 (0, 80) 0.52013528953 (0, 577) 0.603928505945 (1, 160) 1.0 (2, 94) 1.0 (3, 573) 1.0 (4, 255) 1.0 (5, 604) 1.0 ...

虽然这看起来不错,但它不是我想要的确切表示。 有两个主要问题:

我认为'标签'矩阵中的每一行都被视为一个不真实的文档。 许多电影都由不同的用户添加为单独的条目。 'X'中的ID是矩阵索引。 我怎样才能知道相应的ML-ID? 假设我想知道MLid:150的电影的tf-idf表示。我怎么知道这个?

如果你能让我知道我可以如何解决上述这些我认为相当简单的任务,那将会很好。

I would like to obtain the tf-idf representation for MovieLens tag dataset. The tags are in a 'coo' format:

import pandas as pd ratings = pd.read_csv('data/ratings.csv',sep=',') movies = pd.read_csv('data/movies.csv',sep=',') tags = pd.read_csv('data/tags.csv',sep=',') print(tags) userId movieId tag \ 0 15 339 sandra 'boring' bullock 1 15 1955 dentist 2 15 7478 Cambodia 3 15 32892 Russian 4 15 34162 forgettable 5 15 35957 short 6 15 37729 dull story 7 15 45950 powerpoint 8 15 100365 activist 9 15 100365 documentary 10 15 100365 uganda 11 23 150 Ron Howard ...

The first version of my tf-idf code looks like this:

vectorizer = TfidfVectorizer(use_idf=True, norm= 'l2') X = vectorizer.fit_transform(tags['tag']) print(X) (0, 89) 0.603928505945 (0, 80) 0.52013528953 (0, 577) 0.603928505945 (1, 160) 1.0 (2, 94) 1.0 (3, 573) 1.0 (4, 255) 1.0 (5, 604) 1.0 ...

While this looks nice, it is not the exact representation that I want. There are two main problems:

I think each line in the 'tag' matrix is treated as one document which is not true. Many movies are tagged by different users added as separate entries. The ids in 'X' are matrix indices. How can I know the corresponding ML-ids? Suppose I want to know the tf-idf representation for movie with MLid: 150. How can I know this?

It would be nice if you could let me know how I can fix the above cases which I think is quite an easy task.

最满意答案

输入

userId movieId tag 15 339 sandra 'boring' bullock 15 1955 dentist 15 7478 Cambodia 15 32892 Russian 15 34162 forgettable 15 35957 short 15 37729 dull story 15 45950 powerpoint 15 100365 activist 15 100365 documentary 15 100365 uganda 23 150 Ron Howard

import pandas as pd # consolidated dataset tags = pd.read_csv('tfidf_input1.csv') concatenated_tags = tags.groupby('movieId')['tag'].apply(lambda x: "%s" % ' '.join(x)).reset_index() #print concatenated_tags # TfidfVectorization from sklearn.feature_extraction.text import TfidfVectorizer vec = TfidfVectorizer() X = vec.fit_transform(concatenated_tags['tag']) #print X # knowing IDs in tftdf matrix # you have to convert to dense [NOT AT ALL advised for large matrices] # the output is a compressed sparse matrix for the memory reason X_dense = X.todense() print vec.get_feature_names() print X_dense[0,:] # output for the first movieId

Input

userId movieId tag 15 339 sandra 'boring' bullock 15 1955 dentist 15 7478 Cambodia 15 32892 Russian 15 34162 forgettable 15 35957 short 15 37729 dull story 15 45950 powerpoint 15 100365 activist 15 100365 documentary 15 100365 uganda 23 150 Ron Howard

Code

import pandas as pd # consolidated dataset tags = pd.read_csv('tfidf_input1.csv') concatenated_tags = tags.groupby('movieId')['tag'].apply(lambda x: "%s" % ' '.join(x)).reset_index() #print concatenated_tags # TfidfVectorization from sklearn.feature_extraction.text import TfidfVectorizer vec = TfidfVectorizer() X = vec.fit_transform(concatenated_tags['tag']) #print X # knowing IDs in tftdf matrix # you have to convert to dense [NOT AT ALL advised for large matrices] # the output is a compressed sparse matrix for the memory reason X_dense = X.todense() print vec.get_feature_names() print X_dense[0,:] # output for the first movieId

更多推荐

本文发布于:2023-08-05 04:21:00,感谢您对本站的认可!
本文链接:https://www.elefans.com/category/jswz/34/1428115.html
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。
本文标签:格式   数据   ML   python   coo

发布评论

评论列表 (有 0 条评论)
草根站长

>www.elefans.com

编程频道|电子爱好者 - 技术资讯及电子产品介绍!