改进流Python分类器并结合功能(Improve flow Python classifier and combine features)

我正在尝试创建一个分类器来对网站进行分类。我这是第一次这样做，所以这对我来说都很新鲜。目前我正在尝试在网页的几个部分（例如标题，文本，标题）上做一些词袋。它看起来像这样：

from sklearn.feature_extraction.text import CountVectorizer countvect_text = CountVectorizer(encoding="cp1252", stop_words="english") countvect_title = CountVectorizer(encoding="cp1252", stop_words="english") countvect_headings = CountVectorizer(encoding="cp1252", stop_words="english") X_tr_text_counts = countvect_text.fit_transform(tr_data['text']) X_tr_title_counts = countvect_title.fit_transform(tr_data['title']) X_tr_headings_counts = countvect_headings.fit_transform(tr_data['headings']) from sklearn.feature_extraction.text import TfidfTransformer transformer_text = TfidfTransformer(use_idf=True) transformer_title = TfidfTransformer(use_idf=True) transformer_headings = TfidfTransformer(use_idf=True) X_tr_text_tfidf = transformer_text.fit_transform(X_tr_text_counts) X_tr_title_tfidf = transformer_title.fit_transform(X_tr_title_counts) X_tr_headings_tfidf = transformer_headings.fit_transform(X_tr_headings_counts) from sklearn.naive_bayes import MultinomialNB text_nb = MultinomialNB().fit(X_tr_text_tfidf, tr_data['class']) title_nb = MultinomialNB().fit(X_tr_title_tfidf, tr_data['class']) headings_nb = MultinomialNB().fit(X_tr_headings_tfidf, tr_data['class']) X_te_text_counts = countvect_text.transform(te_data['text']) X_te_title_counts = countvect_title.transform(te_data['title']) X_te_headings_counts = countvect_headings.transform(te_data['headings']) X_te_text_tfidf = transformer_text.transform(X_te_text_counts) X_te_title_tfidf = transformer_title.transform(X_te_title_counts) X_te_headings_tfidf = transformer_headings.transform(X_te_headings_counts) accuracy_text = text_nb.score(X_te_text_tfidf, te_data['class']) accuracy_title = title_nb.score(X_te_title_tfidf, te_data['class']) accuracy_headings = headings_nb.score(X_te_headings_tfidf, te_data['class'])

这很好，我得到了预期的准确性。但是，正如您可能已经猜到的那样，这看起来很混乱，并且充满了重复。我的问题是， 有没有办法更简洁地写这个？

另外，我不确定如何将这三个特征组合成单个多项分类器 。我尝试将一个tfidf值列表传递给MultinomialNB().fit() ，但显然这是不允许的。

可选地，向特征添加权重也是很好的，因此在最终分类器中，一些向量具有比其他向量更高的重要性。

我猜我需要pipeline但我不确定在这种情况下我应该如何使用它。

I am trying to create a classifier to categorize websites. I am doing this for the very first time so it's all quite new to me. Currently I am trying to do some Bag of Words on a couple of parts of the web page (e.g. title, text, headings). It looks like this:

from sklearn.feature_extraction.text import CountVectorizer countvect_text = CountVectorizer(encoding="cp1252", stop_words="english") countvect_title = CountVectorizer(encoding="cp1252", stop_words="english") countvect_headings = CountVectorizer(encoding="cp1252", stop_words="english") X_tr_text_counts = countvect_text.fit_transform(tr_data['text']) X_tr_title_counts = countvect_title.fit_transform(tr_data['title']) X_tr_headings_counts = countvect_headings.fit_transform(tr_data['headings']) from sklearn.feature_extraction.text import TfidfTransformer transformer_text = TfidfTransformer(use_idf=True) transformer_title = TfidfTransformer(use_idf=True) transformer_headings = TfidfTransformer(use_idf=True) X_tr_text_tfidf = transformer_text.fit_transform(X_tr_text_counts) X_tr_title_tfidf = transformer_title.fit_transform(X_tr_title_counts) X_tr_headings_tfidf = transformer_headings.fit_transform(X_tr_headings_counts) from sklearn.naive_bayes import MultinomialNB text_nb = MultinomialNB().fit(X_tr_text_tfidf, tr_data['class']) title_nb = MultinomialNB().fit(X_tr_title_tfidf, tr_data['class']) headings_nb = MultinomialNB().fit(X_tr_headings_tfidf, tr_data['class']) X_te_text_counts = countvect_text.transform(te_data['text']) X_te_title_counts = countvect_title.transform(te_data['title']) X_te_headings_counts = countvect_headings.transform(te_data['headings']) X_te_text_tfidf = transformer_text.transform(X_te_text_counts) X_te_title_tfidf = transformer_title.transform(X_te_title_counts) X_te_headings_tfidf = transformer_headings.transform(X_te_headings_counts) accuracy_text = text_nb.score(X_te_text_tfidf, te_data['class']) accuracy_title = title_nb.score(X_te_title_tfidf, te_data['class']) accuracy_headings = headings_nb.score(X_te_headings_tfidf, te_data['class'])

This works fine, and I get the accuracies as expected. However, as you might have guessed, this looks cluttered and is filled with duplication. My question then is, is there a way to write this more concisely?

Additionally, I am not sure how I can combine these three features into a single multinomial classifier. I tried passing a list of tfidf values to MultinomialNB().fit(), but apparently that's not allowed.

Optionally, it would also be nice to add weights to the features, so that in the final classifier some vectors have a higher importance than others.

I'm guessing I need pipeline but I'm not at all sure how I should use it in this case.

最满意答案

首先，可以使用TfidfVectorizer （基本上是两者的组合）删除CountVectorizer和TfidfTransformer。

其次，TfidfVectorizer和MultinomialNB可以组合在一个管道中。管道依次应用变换列表和最终估算器。当在Pipeline上调用fit() ，它会一个接一个地适应所有变换并转换数据，然后使用最终估算器拟合转换后的数据。当调用score()或predict() ，它只调用所有变换器上的transform score()和最后一个上的score()或predict() 。

所以代码看起来像：

from sklearn.pipeline import Pipeline pipeline = Pipeline([('vectorizer', TfidfVectorizer(encoding="cp1252", stop_words="english", use_idf=True)), ('nb', MultinomialNB())]) accuracy={} for item in ['text', 'title', 'headings']: # No need to save the return of fit(), it returns self pipeline.fit(tr_data[item], tr_data['class']) # Apply transforms, and score with the final estimator accuracy[item] = pipeline.score(te_data[item], te_data['class'])

编辑：编辑包括所有功能的组合，以获得单一的准确性：

为了结合结果，我们可以采用多种方法。一个容易理解的（但有点再次走向杂乱的一面）如下：

# Using scipy to concatenate, because tfidfvectorizer returns sparse matrices from scipy.sparse import hstack def get_tfidf(tr_data, te_data, columns): train = None test = None tfidfVectorizer = TfidfVectorizer(encoding="cp1252", stop_words="english", use_idf=True) for item in columns: temp_train = tfidfVectorizer.fit_transform(tr_data[item]) train = hstack((train, temp_train)) if train is not None else temp_train temp_test = tfidfVectorizer.transform(te_data[item]) test = hstack((test , temp_test)) if test is not None else temp_test return train, test train_tfidf, test_tfidf = get_tfidf(tr_data, te_data, ['text', 'title', 'headings']) nb = MultinomialNB() nb.fit(train_tfidf, tr_data['class']) nb.score(test_tfidf, te_data['class'])

第二种方法（更优选的）将包括所有这些方法。但是由于选择了不同的列（“文本”，“标题”，“标题”）并连接结果，因此不是那么简单。我们需要为它们使用FeatureUnion。特别是以下示例：

http://scikit-learn.org/stable/auto_examples/hetero_feature_union.html#sphx-glr-auto-examples-hetero-feature-union-py

第三，如果你打开使用其他库，那么来自sklearn-pandas的DataFrameMapper可以简化前面例子中使用的FeatureUnions的使用。

如果您想要第二或第三路，如果遇到任何困难，请随时联系。

注意：我没有检查代码，但它应该工作（减少一些语法错误，如果有的话）。将在我的电脑上尽快检查。

First, CountVectorizer and TfidfTransformer can be removed by using TfidfVectorizer (which is essentially combination of both).

Second, the TfidfVectorizer and MultinomialNB can be combined in a Pipeline. A pipeline sequentially apply a list of transforms and a final estimator. When fit() is called on a Pipeline, it fit all the transforms one after the other and transform the data, then fit the transformed data using the final estimator. And when score() or predict() is called, it only call transform() on all transformers and score() or predict() on last one.

So the code will look like:

from sklearn.pipeline import Pipeline pipeline = Pipeline([('vectorizer', TfidfVectorizer(encoding="cp1252", stop_words="english", use_idf=True)), ('nb', MultinomialNB())]) accuracy={} for item in ['text', 'title', 'headings']: # No need to save the return of fit(), it returns self pipeline.fit(tr_data[item], tr_data['class']) # Apply transforms, and score with the final estimator accuracy[item] = pipeline.score(te_data[item], te_data['class'])

EDIT: Edited to include the combining of all features to get single accuracy:

To combine the results, we can follow multiple approaches. One that is easily understandable (but a bit of again going to the cluttery side) is the following:

# Using scipy to concatenate, because tfidfvectorizer returns sparse matrices from scipy.sparse import hstack def get_tfidf(tr_data, te_data, columns): train = None test = None tfidfVectorizer = TfidfVectorizer(encoding="cp1252", stop_words="english", use_idf=True) for item in columns: temp_train = tfidfVectorizer.fit_transform(tr_data[item]) train = hstack((train, temp_train)) if train is not None else temp_train temp_test = tfidfVectorizer.transform(te_data[item]) test = hstack((test , temp_test)) if test is not None else temp_test return train, test train_tfidf, test_tfidf = get_tfidf(tr_data, te_data, ['text', 'title', 'headings']) nb = MultinomialNB() nb.fit(train_tfidf, tr_data['class']) nb.score(test_tfidf, te_data['class'])

Second approach (and more preferable) will be to include all these in pipeline. But due to selecting the different columns ('text', 'title', 'headings') and concatenating the results, its not that straightforward. We need to use FeatureUnion for them. And specifically the following example:

http://scikit-learn.org/stable/auto_examples/hetero_feature_union.html#sphx-glr-auto-examples-hetero-feature-union-py

Third, if you are open to use other libraries, then DataFrameMapper from sklearn-pandas can simplify the usage of FeatureUnions used in previous example.

If you do want to go the second or third way, please feel free to contact if having any difficulties.

NOTE: I have not checked the code, but it should work (less some syntax errors, if any). Will check as soon as on my pc.

更多推荐