机器学习（三）——决策树构建

编程入门行业动态更新时间:2024-10-24 19:19:15

机器学习（三）——决策树构建

1.决策树解析

决策树是一种描写对实例进行分类的树型结构。决策树由结点和有向边组成。结点有两种类型：内部结点和叶节点。内部结点表示一个特征或属性，叶节点表示一个类。决策树分类算法是一种基于实例的归纳学习方法，它能从给定的无序的训练样本中，提炼出树型的分类模型。树中的每个非叶子节点记录了使用哪个特征来进行类别的判断，每个叶子节点则代表了最后判断的类别。

决策树的优点有：计算复杂度不高，输出结果容易理解，对中间值的缺失不敏感，可以处理不相关的特征数据。其缺点就是容易产生过度匹配的问题，也就是过拟合问题。

2.决策树的简单构建

2.1信息增益构建决策树

我们先按照书上的代码示例来构建一个简单的决策树，在此之前先了解一下信息增益的基本信息。划分数据集的大原则是：将无序的数据变得更有序。在划分数据集之前之后信息发生的变化称为信息增益，我们可以计算每一个特征值划分数据集获得信息增益，获得信息增益最高的特征就是最好的选择。信息熵的计算公式为

信息增益的计算方法为：

下面是计算给定数据集的香农熵的方法

def calcShannonEnt(dataSet):numEntries = len(dataSet)labelCounts = {}for featVec in dataSet:currentLabel = featVec[-1]if currentLabel not in labelCounts.keys():labelCounts[currentLabel] = 0labelCounts[currentLabel] += 1shannonEnt = 0for key in labelCounts:prob = float(labelCounts[key])/numEntriesshannonEnt -= prob*log(prob,2)return shannonEnt

现在按照书本上的简单例子来测试一下

def createDataSet():dataSet = [[1,1,'yes'],[1,1,'yes'],[1,0,'no'],[0,1,'no'],[0,1,'no']]labels = ['no surfacing','flippers']return dataSet,labels

 myDat,labels=createDataSet()print(myDat)print(calcShannonEnt(myDat))

结果展示：

熵越高则混合的数据也越多，我们可以通过在数据集中添加更多的分类，观察熵是怎么变化的

 myDat[0][-1]='maybe'print(myDat)print(calcShannonEnt(myDat))

分类算法除了需要测量信息熵，还需要划分数据集，度量划分数据集的熵，以便判断当前是否正确的划分了数据集。我们将对每个特征划分数据集的结果计算一次信息熵，判断哪一个特征是划分数据集是最后的划分方式。

划分数据集的代码如下，三个参数分别为待划分的数据集，划分数据集的特征，特征的返回值

def splitDataSet(dataSet,axis,value):retDataSet=[]for featVec in dataSet:if featVec[axis] == value:reducedFeatVec = featVec[:axis]reducedFeatVec.extend(featVec[axis+1:])retDataSet.append(reducedFeatVec)return retDataSet

输出结果测试

print(splitDataSet(myDat,0,1))print(splitDataSet(myDat,0,0))

接下来按照最好的划分方式实现选取特征，划分数据集，计算出最好的划分数据集的特征。在函数中调用的数据要满足两个要求，一个是数据必须是一种有列表元素组成的列表，而且列表元素都具有相同的长度，第二个是，数据的最后一列或实例的最后一个元素是当前实例的特征标签。具体实现的代码如下：

def chooseBestFeatureToSplit(dataSet):numFeatures = len(dataSet[0])-1baseEntropy = calcShannonEnt(dataSet)bestInfoGain = 0.0;bestFeature = -1;for i in range(numFeatures):featList = [example[i] for example in dataSet]uniqueVals = set(featList)newEntropy = 0.0;for value in uniqueVals:subDataset = splitDataSet(dataSet,i,value)prob = len(subDataset)/float(len(dataSet))newEntropy += prob*calcShannonEnt(subDataset)infoGain = baseEntropy-newEntropyif(infoGain>bestInfoGain):bestInfoGain = infoGain;bestFeature = i;return bestFeature

结果测试：

chooseBestFeatureToSplit(myDat)print(myDat)

完成上面的几个步骤就可以来构建决策树了，首先用下面的函数来创建键值为classList中唯一值的数据字典。字典对象存储了classList中没一类标签出现的频率，然后排序，并返回最多的分类标签。

def majoritityCnt(classList):classCount={}for vote in classList:if vote not in classCount.keys(): classCount[vote]=0classCount[vote] += 1sortedClassCount = sorted(classCount.items(),key=operator.itemgetter(1),reverse=True)return sortedClassCount[0][0]

然后就可以根据划分好的数据集和标签来创建决策树了

def createTree(dataSet,labels):classList = [example[-1] for example in dataSet]if classList.count(classList[0]) ==len(classList):return classList[0]if len(dataSet[0]) == 1:return majoritityCnt(classList)bestFeat = chooseBestFeatureToSplit(dataSet)bestFeatLabel = labels[bestFeat]myTree = {bestFeatLabel:{}}del(labels[bestFeat])featValues = [example[bestFeat] for example in dataSet]uniqueValues = set(featValues)for value in uniqueValues:subLabels = labels[:]myTree[bestFeatLabel][value] = createTree(splitDataSet(dataSet,bestFeat,value),subLabels)return myTree

结果测试：

 myTree=createTree(myDat,labels)print(myTree)

接下来就可以用Matplotilb来绘制树形图了。首先定义文本框和箭头格式，还有待箭头的注解。同时创建creatPlot（）函数，这个函数首先创建新图像并清空绘图区，然后在绘图区上绘制两个代表不同类型的树节点。具体代码如下

decisionNode = dict(boxstyle="sawtooth", fc="0.8")
leafNode = dict(boxstyle="round4", fc="0.8")
arrow_args = dict(arrowstyle="<-")def plotNode(nodeTxt, centerpt, parentpt, nodeType):createPlot.axl.annotate(nodeTxt, xy=parentpt, xycoords='axes fraction', xytext=centerpt,textcoords='axes fraction', va="center", ha="center", bbox=nodeType, arrowprops=arrow_args)def createPlot():fig = plt.figure(1, facecolor='white')fig.clf()createPlot.axl = plt.subplot(111, frameon=False)plotNode('a decision node', (0.5, 0.1), (0.1, 0.5), decisionNode)plotNode('a left node', (0.8, 0.1), (0.3, 0.8), leafNode)plt.show()

测试结果展示：

接下来就是绘制一个完整的注解图，我们需要知道有多少叶结点，以便正确确定x轴的长度，然后要知道有多少层，确定y轴的高度。下面两个函数将分别获得树的叶结点和深度

def getNumLeafs(myTree):numLeafs=0firstStr = list(myTree.keys())[0]secondDict=myTree[firstStr]for key in secondDict.keys():if type(secondDict[key]).__name__=='dict':numLeafs += getNumLeafs(secondDict[key])else: numLeafs += 1return numLeafsdef getTreeDepth(myTree):maxDepth=0firstStr=list(myTree.keys())[0]secondDict = myTree[firstStr]for key in secondDict.keys():if type(secondDict[key]).__name__=='dict':thisDepth =1+getTreeDepth(secondDict[key])else:thisDepth = 1if thisDepth>maxDepth: maxDepth=thisDepthreturn maxDepth

结果测试：

print(getNumLeafs(myTree))print(getTreeDepth(myTree))

然后就是完整的决策树的构建，具体代码如下。函数createPlot是我们使用的主函数，调用plotTree（）和plotMidText（），plotTree（）主要采用递归方式。

def plotMidText(cntrPt,parentPt,txtString):#在夫子节点之间传递信息xMid=(parentPt[0]-cntrPt[0])/2.0+cntrPt[0]yMid=(parentPt[1]-cntrPt[1])/2.0+cntrPt[1]createPlot.axl.text(xMid,yMid,txtString)def plotTree(myTree,parentPt,nodeTxt):numLeafs=getNumLeafs(myTree)depth=getTreeDepth(myTree)firstStr=list(myTree.keys())[0]cntrpt = (plotTree.xoff+(1.0+float(numLeafs))/2.0/plotTree.totalW,plotTree.yoff)plotMidText(cntrpt,parentPt,nodeTxt)plotNode(firstStr,cntrpt,parentPt,decisionNode)secondDict = myTree[firstStr]plotTree.yoff = plotTree.yoff-1.0/plotTree.totalDfor key in secondDict.keys():if type(secondDict[key]).__name__=='dict':plotTree(secondDict[key],cntrpt,str(key))else:plotTree.xoff=plotTree.xoff+1.0/plotTree.totalWplotNode(secondDict[key],(plotTree.xoff,plotTree.yoff),cntrpt,leafNode)plotMidText((plotTree.xoff,plotTree.yoff),cntrpt,str(key))plotTree.yoff=plotTree.yoff+1.0/plotTree.totalDdef createPlot(inTree):fig = plt.figure(1, facecolor='white')fig.clf()axprops=dict(xticks=[],yticks=[])createPlot.axl=plt.subplot(111,frameon=False,**axprops)plotTree.totalW = float(getNumLeafs(inTree))plotTree.totalD = float(getTreeDepth(inTree))plotTree.xoff = -0.5/plotTree.totalW; plotTree.yoff=1.0;plotTree(inTree,(0.5,1.0),'')plt.show()

输出结果展示

在树字典中添加一些其他数据，可以观察到决策树的变化

 myTree['no surfacing'][3] = 'maybe'print(myTree)createPlot(myTree)

2.2信息增益率构建决策树

信息增益构建决策树时，会发现对于可取数目较多的属性有所偏好，即当一个属性又能划分成多个时，信息增益的值就会偏大。因此有了一个新的概念，信息增益率。信息增益率的公式为：

其中IV(a)为：

但同时，这个算法对于可取值数目较少的属性有所偏好，即当一个属性又能划分较少时会，信息增益率的值会偏大

其代码的实现与信息增益大值相同，不同的地方仅仅在选择最好的数据集划分时计算IV（a），并除以IV（a）值，其他代码与信息增益相同。这里给上不同的地方

def chooseBestFeatureToSplit(dataSet):numFeatures = len(dataSet[0])-1baseEntropy = calcShannonEnt(dataSet)bestInfoGain = 0.0;splitinfo=0.0bestFeature = -1;for i in range(numFeatures):featList = [example[i] for example in dataSet]uniqueVals = set(featList)newEntropy = 0.0;for value in uniqueVals:subDataset = splitDataSet(dataSet,i,value)prob = len(subDataset)/float(len(dataSet))newEntropy += prob*calcShannonEnt(subDataset)splitinfo+=-prob*log(prob,2)infoGain = baseEntropy-newEntropyif(splitinfo==0):continueinfoGain = infoGain/splitinfoif(infoGain>bestInfoGain):bestInfoGain = infoGain;bestFeature = i;return bestFeature

其余代码与信息增益的完全相同，这里给出最后的结果截图，因为实验数据过于简单，更前一个没有什么区别，后面会增加实验数据然后看到不同结果

2.3基尼指数建造决策树

分类问题中，假设D有K个类，样本点属于第k类的概率为, 则概率分布的基尼值定义为：

Gini(D)越小，数据集D的纯度越高,给定数据集D，属性a的基尼指数定义为：

3.多种数据展示决策树

下面数据属性有6个，分别为yea，0,1,2分别表示老年，中年，青年；work，0,1,2,3,4,5分别表示工资1000以下，1000到3000,3000到5000,5000到8000，8000到10000,10000以上；house，0， 1分别表示有房子和无房子；car, 0,1 ,2分别表示无车，一般车，好车；child，0，1 分别表示有孩子，无孩子；credit，0,1,2分别表示信用差，信用一般，信用好。最终判断是否给这个人贷款。

信息增益构建：

信息增益率：