http://scikit-learn/stable/modules/feature_extraction.html
带病在网吧里。。。。。。写,求支持。。。
1、首先澄清两个概念:特征提取和特征选择(
Feature extraction is very different from Feature selection
)。the former consists in transforming arbitrary data, such as text or images, into numerical features usable for machine learning. The latter is a machine learning technique applied on these features(从已经提取的特征中选择更好的特征).
下面分为四大部分来讲,主要还是4、text feature extraction
2、loading features form dicts
class DictVectorizer。举个例子就好:
上面的PoS特征就可以vectorized into a sparse two-dimensional matrix suitable for feeding into a classifier (maybe after being piped into a text.TfidfTransformer for normalization):
>>>
3、feature hashing
The class FeatureHasher is a high-speed, low-memory vectorizer that uses a technique known as feature hashing, or the “hashing trick”.
由于hash,所以只保存feature的interger index,而不保存原来feature的string名字,所以没有inverse_transform方法。
FeatureHasher 接收dict对,即 (feature, value) 对,或者strings,由构造函数的参数input_type决定.结果是scipy.sparse matrix。如果是strings,则value默认取1,例如 ['feat1', 'feat2', 'feat2'] 被解释为[('feat1', 1), ('feat2', 2)].
4、text feature extraction
因为内容太多,分开写了,参考着篇博客:http://blog.csdn/mmc2015/article/details/46997379
5、image feature extraction
提取部分图片(Patch extraction):
The extract_patches_2d function从图片中提取小块,存储成two-dimensional array, or three-dimensional with color information along the third axis. 使用reconstruct_from_patches_2d. 能够将所有的小块重构成原图:
重构方式如下:
The PatchExtractor class和 extract_patches_2d,一样,只不过可以同时接受多个图片作为输入:
图片像素的连接(Connectivity graph of an image):
主要是根据像素的差别来判断图片的每两个像素点是否连接。。。。。
The function img_to_graph returns such a matrix from a 2D or 3D image. Similarly, grid_to_graph build a connectivity matrix for images given the shape of these image.
这有个直观的例子:http://scikit-learn/stable/auto_examples/cluster/plot_lena_ward_segmentation.html#example-cluster-plot-lena-ward-segmentation-py
头疼。。。。碎觉。。。
更多推荐
scikit-learn:4.2. Feature extraction(特征提取,不是特征选择)
发布评论