用于coo格式python的ML数据集的TFIDF表示(TFIDF representation for ML dataset in coo format python)
我想获取MovieLens标记数据集的tf-idf表示。 这些标签采用'coo'格式:
import pandas as pd ratings = pd.read_csv('data/ratings.csv',sep=',') movies = pd.read_csv('data/movies.csv',sep=',') tags = pd.read_csv('data/tags.csv',sep=',') print(tags) userId movieId tag \ 0 15 339 sandra 'boring' bullock 1 15 1955 dentist 2 15 7478 Cambodia 3 15 32892 Russian 4 15 34162 forgettable 5 15 35957 short 6 15 37729 dull story 7 15 45950 powerpoint 8 15 100365 activist 9 15 100365 documentary 10 15 100365 uganda 11 23 150 Ron Howard ...我的tf-idf代码的第一个版本如下所示:
vectorizer = TfidfVectorizer(use_idf=True, norm= 'l2') X = vectorizer.fit_transform(tags['tag']) print(X) (0, 89) 0.603928505945 (0, 80) 0.52013528953 (0, 577) 0.603928505945 (1, 160) 1.0 (2, 94) 1.0 (3, 573) 1.0 (4, 255) 1.0 (5, 604) 1.0 ...虽然这看起来不错,但它不是我想要的确切表示。 有两个主要问题:
我认为'标签'矩阵中的每一行都被视为一个不真实的文档。 许多电影都由不同的用户添加为单独的条目。 'X'中的ID是矩阵索引。 我怎样才能知道相应的ML-ID? 假设我想知道MLid:150的电影的tf-idf表示。我怎么知道这个?如果你能让我知道我可以如何解决上述这些我认为相当简单的任务,那将会很好。
I would like to obtain the tf-idf representation for MovieLens tag dataset. The tags are in a 'coo' format:
import pandas as pd ratings = pd.read_csv('data/ratings.csv',sep=',') movies = pd.read_csv('data/movies.csv',sep=',') tags = pd.read_csv('data/tags.csv',sep=',') print(tags) userId movieId tag \ 0 15 339 sandra 'boring' bullock 1 15 1955 dentist 2 15 7478 Cambodia 3 15 32892 Russian 4 15 34162 forgettable 5 15 35957 short 6 15 37729 dull story 7 15 45950 powerpoint 8 15 100365 activist 9 15 100365 documentary 10 15 100365 uganda 11 23 150 Ron Howard ...The first version of my tf-idf code looks like this:
vectorizer = TfidfVectorizer(use_idf=True, norm= 'l2') X = vectorizer.fit_transform(tags['tag']) print(X) (0, 89) 0.603928505945 (0, 80) 0.52013528953 (0, 577) 0.603928505945 (1, 160) 1.0 (2, 94) 1.0 (3, 573) 1.0 (4, 255) 1.0 (5, 604) 1.0 ...While this looks nice, it is not the exact representation that I want. There are two main problems:
I think each line in the 'tag' matrix is treated as one document which is not true. Many movies are tagged by different users added as separate entries. The ids in 'X' are matrix indices. How can I know the corresponding ML-ids? Suppose I want to know the tf-idf representation for movie with MLid: 150. How can I know this?It would be nice if you could let me know how I can fix the above cases which I think is quite an easy task.
最满意答案
输入
userId movieId tag 15 339 sandra 'boring' bullock 15 1955 dentist 15 7478 Cambodia 15 32892 Russian 15 34162 forgettable 15 35957 short 15 37729 dull story 15 45950 powerpoint 15 100365 activist 15 100365 documentary 15 100365 uganda 23 150 Ron Howard码
import pandas as pd # consolidated dataset tags = pd.read_csv('tfidf_input1.csv') concatenated_tags = tags.groupby('movieId')['tag'].apply(lambda x: "%s" % ' '.join(x)).reset_index() #print concatenated_tags # TfidfVectorization from sklearn.feature_extraction.text import TfidfVectorizer vec = TfidfVectorizer() X = vec.fit_transform(concatenated_tags['tag']) #print X # knowing IDs in tftdf matrix # you have to convert to dense [NOT AT ALL advised for large matrices] # the output is a compressed sparse matrix for the memory reason X_dense = X.todense() print vec.get_feature_names() print X_dense[0,:] # output for the first movieIdInput
userId movieId tag 15 339 sandra 'boring' bullock 15 1955 dentist 15 7478 Cambodia 15 32892 Russian 15 34162 forgettable 15 35957 short 15 37729 dull story 15 45950 powerpoint 15 100365 activist 15 100365 documentary 15 100365 uganda 23 150 Ron HowardCode
import pandas as pd # consolidated dataset tags = pd.read_csv('tfidf_input1.csv') concatenated_tags = tags.groupby('movieId')['tag'].apply(lambda x: "%s" % ' '.join(x)).reset_index() #print concatenated_tags # TfidfVectorization from sklearn.feature_extraction.text import TfidfVectorizer vec = TfidfVectorizer() X = vec.fit_transform(concatenated_tags['tag']) #print X # knowing IDs in tftdf matrix # you have to convert to dense [NOT AT ALL advised for large matrices] # the output is a compressed sparse matrix for the memory reason X_dense = X.todense() print vec.get_feature_names() print X_dense[0,:] # output for the first movieId更多推荐
发布评论