python 机器学习 sklearn 文本特征

编程知识更新时间:2023-04-03 19:45:47

上一篇文章中提到了，文本分类中有三个步骤：

预处理（包括分词，去除停用词等）
特征提取
特征的表示
机器学习的模型选择

第一个函数单词的频率作为特征：

class sklearn.feature_extraction.text.CountVectorizer(input=’content’, encoding=’utf-8’, decode_error=’strict’, strip_accents=None, lowercase=True, preprocessor=None, tokenizer=None, stop_words=None, token_pattern=’(?u)\b\w\w+\b’, ngram_range=(1, 1), analyzer=’word’, max_df=1.0, min_df=1, max_features=None, vocabulary=None, binary=False, dtype=<class ‘numpy.int64’>)

input参数是选择需要处理的文本或者文件；ecode_error参数可以是'ignore'表示遇到encode_error不管他，默认是strict会报encodeerror错误，lowercase代表input的内容需要是小写，请自己先把文件内容变成小写，stop_words表示停用词 ‘english’ 停用英文单词， token_pattern 分割器，用正则表达式表示，ngram_range表示ngram的先知，比如（1,2）就表示，使用1-gram和2-gram，
from sklearn.feature_extraction.text import CountVectorizer
corpus = [
    'This is the first document.',
    'This document is the second document.',
    'And this is the third one.',
    'Is this the first document?',
]
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names())
print(X.toarray())     
第二个函数是TF-IDF

class sklearn.feature_extraction.text.TfidfVectorizer(input=’content’, encoding=’utf-8’, decode_error=’strict’, strip_accents=None, lowercase=True, preprocessor=None, tokenizer=None, analyzer=’word’, stop_words=None, token_pattern=’(?u)\b\w\w+\b’, ngram_range=(1, 1), max_df=1.0, min_df=1, max_features=None, vocabulary=None, binary=False, dtype=<class ‘numpy.float64’>, norm=’l2’, use_idf=True, smooth_idf=True, sublinear_tf=False)
from sklearn.feature_extraction.text import TfidfVectorizer
corpus = [
    'This is the first document.',
    'This document is the second document.',
    'And this is the third one.',
    'Is this the first document?',
]
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names())
print(X.toarray())

fasttext 文本分类

fastnew doc

from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import GaussianNB

categories = ['talk.politics.guns','talk.politics.mideast','talk.politics.misc','talk.religion.misc']
newsgroups_train = fetch_20newsgroups(subset='train',remove=('headers','footers','quotes'),categories=categories)

#train
vectorizer = TfidfVectorizer(stop_words='english')
vectors = vectorizer.fit_transform(newsgroups_train.data)

X=vectors.toarray()
Y=newsgroups_train.target

#learn
PX=vectorizer.transform(['France, which prides itself as the global innovator of fashion, has decided its fashion industry has lost an absolute right to define physical beauty for women. Its lawmakers gave preliminary approval last week to a law that would make it a crime to employ ultra-thin models on runways.The parliament also agreed to ban websites that “incite excessive thinness” by promoting extreme dieting.Such measures have a couple of uplifting motives. They suggest beauty should not be defined by looks that end up impinging on health. That’s a start. And the ban on ultra-thin models seems to go beyond protecting models from starving themselves to death - as some have done. It tells the fashion industry that it must take responsibility for the signal it sends women, especially teenage girls, about the social tape-measure they must use to determine their individual worth.The bans, if fully enforced, would suggest to women (and many men) that they should not let others be arbiters of their beauty. And perhaps faintly, they hint that people should look to intangible qualities like character and intellect rather than dieting their way to size zero or wasp-waist physiques.The French measures, however, rely too much on severe punishment to change a culture that still regards beauty as skin-deep — and bone-showing. Under the law, using a fashion model that does not meet a government-defined index of body mass could result in a $85,000 fine and six months in prison.The fashion industry knows it has an inherent problem in focusing on material adornment and idealized body types. In Denmark, the United States, and a few other countries, it is trying to set voluntary standards for models and fashion images that rely more on peer pressure for enforcement.In contrast to France’s actions, Denmark’s fashion industry agreed last month on rules and sanctions regarding the age, health, and other characteristics of models. The newly revised Danish Fashion Ethical Charter clearly states: “We are aware of and take responsibility for the impact the fashion industry has on body ideals, especially on young people.’ The charter’s main tool of enforcement is to deny access for designers and modeling agencies to Copenhagen Fashion Week, which is run by the Danish Fashion Institute. But in general it relies on a name-and-shame method of compliance.Relying on ethical persuasion rather than law to address the misuse of body ideals may be the best step. Even better would be to help elevate notions of beauty beyond the material standards of a particular industry.']).toarray()

gnb = GaussianNB()
gnb.fit(X,Y)
y_pred=gnb.predict(PX)
print(newsgroups_train.target_names[y_pred[0]])