我有一个数据集,它实际上是某些项目的特征向量的出现矩阵。 理论上,这种类型的表示有助于将机器学习算法应用于数据集作为其规范化。
a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z,class 1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,class1 0,1,0,0,1,0,0,0,1,0,0,0,0,0,1,0,0,0,1,1,1,0,1,0,0,1,class2 0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,1,class2 1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,1,0,0,1,class3但我似乎无法使用python中的pandas和scikit学习提供的算法。 我没见过任何例子。
数据集的格式如下。 其中特征vector =[a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z]和类变量位于显示类的文件的末尾(例如: - 'class1','class2'。'class3')
如何为这种类型的数据集应用决策树算法(如CART和Naive Bayes)? (我只检查了scikit学习库)
I have a data set which is actually a occurrence matrix of a feature vector for some numbers of items. In theory, this type of representation helps to apply machine learning algorithms to data set as its normalized.
a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z,class 1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,class1 0,1,0,0,1,0,0,0,1,0,0,0,0,0,1,0,0,0,1,1,1,0,1,0,0,1,class2 0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,1,class2 1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,1,0,0,1,class3But I cant seem to use the algorithms provided by pandas and scikit learning in python. I haven't seen any examples.
The format of the data set is as follows. where feature vector =[a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z] and class variable is at the end of the file showing classes (eg:- 'class1','class2'.'class3')
How could I apply decision tree algorithms such as CART and Naive Bayes for this type of data sets? (I only checked scikit learning library)
最满意答案
您需要为类/因变量使用整数,而不是字符串。
这是一个例子:
In [1]: # Here I'm just mapping very simply, you can definitely use regex or something for your case if you have a lot of classes df['class'] = df['class'].map({'class1':0, 'class2':1, 'class3':2}) In [2]: df Out[2]: a b ... y z class 0 1 ... 1 1 0 1 0 ... 0 1 1 2 0 ... 0 1 1 3 1 ... 0 1 2 In [3]: # Break between X (independent variables) and y (dependent, class) X = df.iloc[:,:-1] y = df['class'] In [4]: # Now you can do your fit etc... from sklearn.naive_bayes import GaussianNB gnb = GaussianNB() result = gnb.fit(X,y) In [5]: y_pred = result.predict(X) y_pred Out[5]: array([0, 1, 1, 2], dtype=int64)我们看到它正确地预测了该类(显然,假设有大量特征与样本大小(p> n))。
You need to use integers for your class/dependent variable, not strings.
Here's an example:
In [1]: # Here I'm just mapping very simply, you can definitely use regex or something for your case if you have a lot of classes df['class'] = df['class'].map({'class1':0, 'class2':1, 'class3':2}) In [2]: df Out[2]: a b ... y z class 0 1 ... 1 1 0 1 0 ... 0 1 1 2 0 ... 0 1 1 3 1 ... 0 1 2 In [3]: # Break between X (independent variables) and y (dependent, class) X = df.iloc[:,:-1] y = df['class'] In [4]: # Now you can do your fit etc... from sklearn.naive_bayes import GaussianNB gnb = GaussianNB() result = gnb.fit(X,y) In [5]: y_pred = result.predict(X) y_pred Out[5]: array([0, 1, 1, 2], dtype=int64)We see that it predicted correctly the class (obviously given the high number of features versus sample size (p>n)).
更多推荐
发布评论