机器学习实战:用逻辑回归从疝气病预测病马死亡状况

编程入门 行业动态 更新时间:2024-10-22 18:41:23

机器学习实战:用逻辑回归从<a href=https://www.elefans.com/category/jswz/34/1689665.html style=疝气病预测病马死亡状况"/>

机器学习实战:用逻辑回归从疝气病预测病马死亡状况

一、数据处理
首先,我们调用数据,并查看数据:

file = open('F:/MachineLearning/data/horse-colic.txt')
print(file.read())

得到数据显示为:

1 1 530170 38.10 88 24 3 3 4 1 5 4 3 2 1 ? 3 4 41.00 4.60 ? ? 2 1 02209 00000 00000 2 
1 1 527709 38.00 108 60 2 3 4 1 4 3 3 2 ? ? 3 4 ? ? 3 ? 1 1 02205 00000 00000 2 
2 1 528169 38.20 48 ? 2 ? 1 2 3 3 1 2 1 ? ? 2 34.00 6.60 ? ? 1 2 03111 00000 00000 2 
......
1 1 529386 37.50 72 30 4 3 4 1 4 4 3 2 1 ? 3 5 60.00 6.80 ? ? 2 1 03205 00000 00000 2
1 1 530612 36.50 100 24 3 3 3 1 3 3 3 3 1 ? 4 4 50.00 6.00 3 3.40 1 1 02208 00000 00000 1
1 1 534618 37.2 40 20 ? ? ? ? ? ? ? ? ? ? 4 1 36 62 1 1 3 2 06112 00000 00000 2 

在查看这些特征所代表的含义:

names_data = open('F:/MachineLearning/data/horse-colic_names.txt')
names=names_data.read()
print(names)

得到:

1. TItle: Horse Colic database2. Source Information-- Creators: Mary McLeish & Matt CecileDepartment of Computer ScienceUniversity of GuelphGuelph, Ontario, Canada N1G 2W1mdmcleish@water.waterloo.edu-- Donor:    Will Taylor (taylor@pluto.arc.nasa.gov)-- Date:     8/6/893. Past Usage:-- Unknown4. Relevant Information:-- 2 data files -- horse-colic.data: 300 training instances-- horse-colic.test: 68 test instances-- Possible class attributes: 24 (whether lesion is surgical)-- others include: 23, 25, 26, and 27-- Many Data types: (continuous, discrete, and nominal)5. Number of Instances: 368 (300 for training, 68 for testing)6. Number of attributes: 287. Attribute Information:1:  surgery?1 = Yes, it had surgery2 = It was treated without surgery2:  Age 1 = Adult horse2 = Young (< 6 months)3:  Hospital Number - numeric id- the case number assigned to the horse(may not be unique if the horse is treated > 1 time)4:  rectal temperature- linear- in degrees celsius.- An elevated temp may occur due to infection.- temperature may be reduced when the animal is in late shock- normal temp is 37.8- this parameter will usually change as the problem progresseseg. may start out normal, then become elevated because ofthe lesion, passing back through the normal range as thehorse goes into shock5:  pulse - linear- the heart rate in beats per minute- is a reflection of the heart condition: 30 -40 is normal for adults- rare to have a lower than normal rate although athletic horsesmay have a rate of 20-25- animals with painful lesions or suffering from circulatory shockmay have an elevated heart rate6:  respiratory rate- linear- normal rate is 8 to 10- usefulness is doubtful due to the great fluctuations7:  temperature of extremities- a subjective indication of peripheral circulation- possible values:1 = Normal2 = Warm3 = Cool4 = Cold- cool to cold extremities indicate possible shock- hot extremities should correlate with an elevated rectal temp.8:  peripheral pulse- subjective- possible values are:1 = normal2 = increased3 = reduced4 = absent- normal or increased p.p. are indicative of adequate circulationwhile reduced or absent indicate poor perfusion9:  mucous membranes- a subjective measurement of colour- possible values are:1 = normal pink2 = bright pink3 = pale pink4 = pale cyanotic5 = bright red / injected6 = dark cyanotic- 1 and 2 probably indicate a normal or slightly increasedcirculation- 3 may occur in early shock- 4 and 6 are indicative of serious circulatory compromise- 5 is more indicative of a septicemia10: capillary refill time- a clinical judgement. The longer the refill, the poorer thecirculation- possible values1 = < 3 seconds2 = >= 3 seconds11: pain - a subjective judgement of the horse's pain level- possible values:1 = alert, no pain2 = depressed3 = intermittent mild pain4 = intermittent severe pain5 = continuous severe pain- should NOT be treated as a ordered or discrete variable!- In general, the more painful, the more likely it is to requiresurgery- prior treatment of pain may mask the pain level to some extent12: peristalsis                              - an indication of the activity in the horse's gut. As the gutbecomes more distended or the horse becomes more toxic, theactivity decreases- possible values:1 = hypermotile2 = normal3 = hypomotile4 = absent13: abdominal distension- An IMPORTANT parameter.- possible values1 = none2 = slight3 = moderate4 = severe- an animal with abdominal distension is likely to be painful andhave reduced gut motility.- a horse with severe abdominal distension is likely to requiresurgery just tio relieve the pressure14: nasogastric tube- this refers to any gas coming out of the tube- possible values:1 = none2 = slight3 = significant- a large gas cap in the stomach is likely to give the horsediscomfort15: nasogastric reflux- possible values1 = none2 = > 1 liter3 = < 1 liter- the greater amount of reflux, the more likelihood that there issome serious obstruction to the fluid passage from the rest ofthe intestine16: nasogastric reflux PH- linear- scale is from 0 to 14 with 7 being neutral- normal values are in the 3 to 4 range17: rectal examination - feces- possible values1 = normal2 = increased3 = decreased4 = absent- absent feces probably indicates an obstruction18: abdomen- possible values1 = normal2 = other3 = firm feces in the large intestine4 = distended small intestine5 = distended large intestine- 3 is probably an obstruction caused by a mechanical impactionand is normally treated medically- 4 and 5 indicate a surgical lesion19: packed cell volume- linear- the # of red cells by volume in the blood- normal range is 30 to 50. The level rises as the circulationbecomes compromised or as the animal becomes dehydrated.20: total protein- linear- normal values lie in the 6-7.5 (gms/dL) range- the higher the value the greater the dehydration21: abdominocentesis appearance- a needle is put in the horse's abdomen and fluid is obtained fromthe abdominal cavity- possible values:1 = clear2 = cloudy3 = serosanguinous- normal fluid is clear while cloudy or serosanguinous indicatesa compromised gut22: abdomcentesis total protein- linear- the higher the level of protein the more likely it is to have acompromised gut. Values are in gms/dL23: outcome- what eventually happened to the horse?- possible values:1 = lived2 = died3 = was euthanized24: surgical lesion?- retrospectively, was the problem (lesion) surgical?- all cases are either operated upon or autopsied so thatthis value and the lesion type are always known- possible values:1 = Yes2 = No25, 26, 27: type of lesion- first number is site of lesion1 = gastric2 = sm intestine3 = lg colon4 = lg colon and cecum5 = cecum6 = transverse colon7 = retum/descending colon8 = uterus9 = bladder11 = all intestinal sites00 = none- second number is type1 = simple2 = strangulation3 = inflammation4 = other- third number is subtype1 = mechanical2 = paralytic0 = n/a- fourth number is specific code1 = obturation2 = intrinsic3 = extrinsic4 = adynamic5 = volvulus/torsion6 = intussuption7 = thromboembolic8 = hernia9 = lipoma/slenic incarceration10 = displacement0 = n/a28: cp_data- is pathology data present for this case?1 = Yes2 = No- this variable is of no significance since pathology datais not included or collected for these cases8. Missing values: 30% of the values are missing

这里我们有28个特征,我们产看这些特征所代表的含义,可以删减或者对一些特征进行处理:①对于第3个特征,代表了医院登记号,实际上它并没有什么统计的意义,所以可以考虑删除这个属性;②对于第23个特征,它代表了我们马活着、死掉或者安乐死,实际上这是我们需要对马预测的死或者活的结果,也就是我们数据的标记,而安乐死和死我们可以一起归类于死亡(这里我们认为是正例,用1表示;存活认为是反例用0表示),我们考虑把这个特征与训练集抽离;③对于第25个特征,也就是用五位数字所表示的特征,实际上这五个数字分别代表了不同的含义,所以考虑把第25个特征拆分为5个特征进行处理④对于第26、27个属性实际上在所有数据上的取值都是一样的,也就是说这两个特征对马的存活情况不产生影响,或者说在这个训练集上体现不出来影响,所以我们考虑删除这两个属性;⑤最后一个特征,即第28个特征表示的是这个病例有无病理资料,对与这个属性后面也补充到:这个属性没有统计意义,我们也可以把它考虑删除;⑥对于问号?部分,很显然是数据的缺失,我们考虑把这部分数据处理为缺失特征的平均值或者0,这里为了方便,我们直接用0处理,它的理论依据是:特征值为0的话,他在梯度函数中的对梯度函数的贡献为0,也就是不产生影响,这个做法是合理的;⑦最后别忘了给训练集添常数项。
此外,我们还可以从最后的提示看出这个数据集拥有30%的数据缺失。
下面为数据处理函数,他直接返回加过常数项的训练集及其标记:

import numpy as npdef prepared_data(path):file = open(path)fr = file.readlines()data_list=[]y = []for data in fr:data=data.strip()data=data.split(' ')data_25 = data[24]y.append(data[22])del data[-1]del data[26]del data[25]del data[24]del data[22]del data[2]for i in data_25:data.append(i)data = [0 if data_elem=='?' else float(data_elem) for data_elem in data]data_list.append(data)data_number = len(y)y = np.array([1 if y[j]=='2' or y[j]=='3' else 0 for j in range(data_number)])X = np.array(data_list)X=np.insert(X,0,1,axis=1)return X,y

用此函数处理训练集和测试集:

X,y = prepared_data('F:/MachineLearning/data/horse-colic.txt')
X_test,y_test=prepared_data('F:/MachineLearning/data/horse-colic-test.txt')

二、 s i g m o i d sigmoid sigmoid函数、代价函数和梯度函数

def sigmoid(z):return 1/(1+np.exp(-z))def regularized_cost(param,X,y,regularized_param):m,d = X.shapeh = sigmoid(X @ param)total_cost1 = - y.T @ np.log(h)total_cost2 = -(1-y).T @ np.log(1-h)cost = (total_cost1 + total_cost2) / mregularized_term = (param.T @ param) * (regularized_param / (2*m))regularized_cost = cost + regularized_termreturn regularized_costdef cost(param,X,y):m,d = X.shapeh = sigmoid(X @ param)total_cost1 = - y.T @ np.log(h)total_cost2 = -(1-y).T @ np.log(1-h)cost = (total_cost1 + total_cost2) / mreturn costdef regularized_gradient(param,X,y,regularized_param):m,d = X.shapeh = sigmoid(X @ param)gradient = (X.T @ (h-y)) / mregularized_term = (regularized_param / m) * (param)regularized_term[0] = 0regularized_gradient = regularized_term + gradientreturn regularized_gradient

这里我们的梯度函数和代价函数都是具有正则化,另外我们我们还要编写一个不带正则化项的代价函数,以便我们可以在测试集上选择正则化系数。

三、选择正则化系数
这里我们考虑在测试集上给出选出正则化系数。实际上这个做法并不谨慎,合理的做法是:应该依靠交叉验证集选出正则化系数,然后在测试集上给出模型的准确率。但是由于我们数据所限,我们采用这个不谨慎的做法。下面为系数选择的函数并输出错误率:

def select_regularized_param(regularized_param_list):param=np.zeros(X.shape[1])from scipy.optimize import minimizefor regularized_param in regularized_param_list:fmin = minimize(fun = regularized_cost,x0=param,args=(X,y,regularized_param),method='TNC',jac=regularized_gradient)theta = fmin.xh_test = sigmoid(X_test @ theta)h_test = [1 if y >=0.5 else 0 for y in h_test]m_test = y_test.shape[0]error_number = 0for i in range(m_test):if y_test[i] != h_test[i]:error_number += 1error_rate = error_number / m_testprint(error_rate,regularized_param)

我们设立一个正则化系数的集合,并从中选择最佳参数:

L=[0,0.001,0.003,0.01,0.03,0.1,0.3,1,3,10,30,100]
select_regularized_param(L)

得到:

0.27941176470588236 0
0.27941176470588236 0.001
0.27941176470588236 0.003
0.27941176470588236 0.01
0.27941176470588236 0.03
0.27941176470588236 0.1
0.27941176470588236 0.3
0.2647058823529412 1
0.2647058823529412 3
0.2647058823529412 10
0.2647058823529412 30
0.29411764705882354 100

我们在得到最低错误率的正则化系数中选择,即1,3,10,30中选择一个正则化系数即可,得到最低错误率为:0.2647058823529412。实际上这个错误率,稍显大了点,但是我们考虑到我们有30%的数据缺失这个结果还是合理的。

更多推荐

机器学习实战:用逻辑回归从疝气病预测病马死亡状况

本文发布于:2024-02-12 22:30:25,感谢您对本站的认可!
本文链接:https://www.elefans.com/category/jswz/34/1689663.html
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。
本文标签:疝气   实战   逻辑   状况   机器

发布评论

评论列表 (有 0 条评论)
草根站长

>www.elefans.com

编程频道|电子爱好者 - 技术资讯及电子产品介绍!