所以我正在使用textblob python库,但是性能却不足.
So i am using textblob python library, but the performance is lacking.
我已经序列化了它,并在循环之前(使用pickle)加载了它.
I already serialize it and load it before the loop( using pickle ).
目前,它需要〜0.1(用于小的训练数据)和〜0.3'的33'000测试数据.我需要使其更快,甚至有可能吗?
It currently takes ~ 0.1( for small training data ) and ~ 0.3 on 33'000 test data. I need to make it faster, is it even possible ?
# Pass trainings before loop, so we can make performance a lot better trained_text_classifiers = load_serialized_classifier_trainings(config["ALL_CLASSIFICATORS"]) # Specify witch classifiers are used by witch classes filter_classifiers = get_classifiers_by_resource_names(trained_text_classifiers, config["FILTER_CLASSIFICATORS"]) signal_classifiers = get_classifiers_by_resource_names(trained_text_classifiers, config["SIGNAL_CLASSIFICATORS"]) for (url, headers, body) in iter_warc_records(warc_file, **warc_filters): start_time = time.time() body_text = strip_html(body); # Check if url body passess filters, if yes, index, if no, ignore if Filter.is_valid(body_text, filter_classifiers): print "Indexing", url.url resp = indexer.index_document(body, body_text, signal_classifiers, url=url, headers=headers, links=bool(args.save_linkgraph_domains)) else: print "\n" print "Filtered out", url.url print "\n" resp = 0这是循环巫婆,对warc文件的主体和元数据进行检查.
This is the loop witch performs check on each of the warc file's body and metadata.
这里有2种文字分类检查.
there are 2 text classification checks here.
1)在过滤器中(非常小的训练数据):
1) In Filter( very small training data ):
if trained_text_classifiers.classify(body_text) == "True": return True else: return False2)在index_document中(33'000个训练数据):
2) In index_document( 33'000 training data ):
prob_dist = trained_text_classifier.prob_classify(body) prob_dist.max() # Return the propability of spam return round(prob_dist.prob("spam"), 2)classify和prob_classify是提高工具性能的方法.
The classify and prob_classify are the methods that take the tool on performance.
推荐答案您可以对数据使用特征选择.一些好的特征选择可以将特征减少多达90%,并保持分类性能. 在特征选择中,选择主要特征(在 Bag Of Word 模型中,选择主要影响词),然后根据这些词(特征)训练模型.这样可以减少数据的维度(也可以防止维度诅咒) 这是一个很好的调查: 有关功能选择的调查
You can use feature selection for your data. some good feature selection can reduce features up to 90% and persist the classification performance. In feature selection you select top feature(in Bag Of Word model, you select top influence words), and train model based on these words(features). this reduce the dimension of your data(also it prevent Curse Of Dimensionality) here is a good survey: Survey on feature selection
简介:
有两种功能选择方法:过滤和包装
Two feature selection approach is available: Filtering and Wrapping
过滤方法几乎是基于信息论的.在共同信息","chi2"和...中搜索此类功能选择
Filtering approach is almost based on information theory. search "Mutual Information", "chi2" and... for this type of feature selection
包装方法使用分类算法来估计库中最重要的功能.例如,您选择一些单词并评估分类性能(召回率,准确性).
Wrapping approach use the classification algorithm to estimate the most important features in the library. for example you select some words and evaluate classification performance(recall,precision).
其他一些方法也可能有用. LSA和LSI可以胜过分类性能和时间: en.wikipedia/wiki/Latent_semantic_analysis
Also some others approch can be usefull. LSA and LSI can outperform the classification performance and time: en.wikipedia/wiki/Latent_semantic_analysis
您可以使用生病进行功能选择和LSA:
You can use sickit for feature selection and LSA:
scikit-learn/stable/modules/feature_selection.html
scikit-learn/stable/modules/decomposition.html
更多推荐
文字分类效果
发布评论