使用二进制矩阵/出现矩阵创建具有pandas和scikit的决策树(Create decision trees with pandas and scikit learn using binary ma

使用二进制矩阵/出现矩阵创建具有pandas和scikit的决策树(Create decision trees with pandas and scikit learn using binary matrix/occurance matrix)

我有一个数据集，它实际上是某些项目的特征向量的出现矩阵。理论上，这种类型的表示有助于将机器学习算法应用于数据集作为其规范化。

a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z,class 1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,class1 0,1,0,0,1,0,0,0,1,0,0,0,0,0,1,0,0,0,1,1,1,0,1,0,0,1,class2 0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,1,class2 1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,1,0,0,1,class3

但我似乎无法使用python中的pandas和scikit学习提供的算法。我没见过任何例子。

数据集的格式如下。其中特征vector =[a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z]和类变量位于显示类的文件的末尾（例如： - 'class1'，'class2'。'class3'）

如何为这种类型的数据集应用决策树算法（如CART和Naive Bayes）？（我只检查了scikit学习库）

I have a data set which is actually a occurrence matrix of a feature vector for some numbers of items. In theory, this type of representation helps to apply machine learning algorithms to data set as its normalized.

a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z,class 1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,class1 0,1,0,0,1,0,0,0,1,0,0,0,0,0,1,0,0,0,1,1,1,0,1,0,0,1,class2 0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,1,class2 1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,1,0,0,1,class3

But I cant seem to use the algorithms provided by pandas and scikit learning in python. I haven't seen any examples.

The format of the data set is as follows. where feature vector =[a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z] and class variable is at the end of the file showing classes (eg:- 'class1','class2'.'class3')

How could I apply decision tree algorithms such as CART and Naive Bayes for this type of data sets? (I only checked scikit learning library)

最满意答案

您需要为类/因变量使用整数，而不是字符串。

这是一个例子：

In [1]: # Here I'm just mapping very simply, you can definitely use regex or something for your case if you have a lot of classes df['class'] = df['class'].map({'class1':0, 'class2':1, 'class3':2}) In [2]: df Out[2]: a b ... y z class 0 1 ... 1 1 0 1 0 ... 0 1 1 2 0 ... 0 1 1 3 1 ... 0 1 2 In [3]: # Break between X (independent variables) and y (dependent, class) X = df.iloc[:,:-1] y = df['class'] In [4]: # Now you can do your fit etc... from sklearn.naive_bayes import GaussianNB gnb = GaussianNB() result = gnb.fit(X,y) In [5]: y_pred = result.predict(X) y_pred Out[5]: array([0, 1, 1, 2], dtype=int64)

我们看到它正确地预测了该类（显然，假设有大量特征与样本大小（p> n））。

You need to use integers for your class/dependent variable, not strings.

Here's an example:

In [1]: # Here I'm just mapping very simply, you can definitely use regex or something for your case if you have a lot of classes df['class'] = df['class'].map({'class1':0, 'class2':1, 'class3':2}) In [2]: df Out[2]: a b ... y z class 0 1 ... 1 1 0 1 0 ... 0 1 1 2 0 ... 0 1 1 3 1 ... 0 1 2 In [3]: # Break between X (independent variables) and y (dependent, class) X = df.iloc[:,:-1] y = df['class'] In [4]: # Now you can do your fit etc... from sklearn.naive_bayes import GaussianNB gnb = GaussianNB() result = gnb.fit(X,y) In [5]: y_pred = result.predict(X) y_pred Out[5]: array([0, 1, 1, 2], dtype=int64)

We see that it predicted correctly the class (obviously given the high number of features versus sample size (p>n)).

更多推荐