会计师事务所python数据分析"/>
四大会计师事务所python数据分析
密码:oknt
思路:将json文件转加载到python中,转化为两个DataFrame(食物、营养成分),以id列进行内连接,根据group和nutrient列分组后,得到每种营养成分在各种食物中的含量高低
step1
import json
data=json.load(open(r'C:\Users\Administrator\Downloads\pydata-book-2nd-edition\datasets\usda_food\database.json'))
#将json格式数据加载到python中,data中都是字典型对象,对应一种食品及其详细介绍
data[0]
#data中的第一个元素,为第一种食物的有关介绍
data[0].keys()
dict_keys(['id', 'description', 'tags', 'manufacturer', 'group', 'portions', 'nutrients'])
#data1中的第一个元素的几个方面
data[0]['nutrients']
#第一种食物的所有营养成分
step2
food=pd.DataFrame(data,columns=['id','description','manufacturer','group'])
food.head()
id description manufacturer group
0 1008 Cheese, caraway Dairy and Egg Products
1 1009 Cheese, cheddar Dairy and Egg Products
2 1018 Cheese, edam Dairy and Egg Products
3 1019 Cheese, feta Dairy and Egg Products
4 1028 Cheese, mozzarella, part skim milk Dairy and Egg Products
#每个元素排列为DataFrame中的每一行,元素中的键作为DataFrame的列
food['group'].value_counts()[:10]
Vegetables and Vegetable Products 812
Beef Products 618
Baked Products 496
Breakfast Cereals 403
Fast Foods 365
Legumes and Legume Products 365
Lamb, Veal, and Game Products 345
Sweets 341
Pork Products 328
Fruits and Fruit Juices 328
#前十种最多的食品种类
step3
total=[]
number=range(0,len(data))
for i in number:
frame=pd.DataFrame(data[i]['nutrients'])
frame['id']=data[i]['id']
total.append(frame)
nutrients=pd.concat(total,ignore_index=True)
nutrients[:5]
description group units value id
0 Protein Composition g 25.18 1008
1 Total lipid (fat) Composition g 29.20 1008
2 Carbohydrate, by difference Composition g 3.06 1008
3 Ash Other g 3.28 1008
4 Energy Energy kcal 376.00 1008
#利用concat函数将所有元素的每个营养成分数据整合在一张DataFrame中
step4
nutrients.duplicated().sum()
14179
#求所有重复项数量
nutrients=nutrients.drop_duplicates()
#删除重复项
nutrients=nutrients.rename(columns={'description':'nutirent','group':'nutgroup',copy=False)
#对重复的列名进行重命名
nutrients[:5]
nutrient nutgroup units value id
0 Protein Composition g 25.18 1008
1 Total lipid (fat) Composition g 29.20 1008
2 Carbohydrate, by difference Composition g 3.06 1008
3 Ash Other g 3.28 1008
4 Energy Energy kcal 376.00 1008
step5
detail=pd.merge(data,nutrients,on='id',how='outer')
detail.head()
id description manufacturer group nutrient nutgroup units value
0 1008 Cheese, caraway Dairy and Egg Products Protein Composition g 25.18
1 1008 Cheese, caraway Dairy and Egg Products Total lipid (fat) Composition g 29.20
2 1008 Cheese, caraway Dairy and Egg Products Carbohydrate, by difference Composition g 3.06
3 1008 Cheese, caraway Dairy and Egg Products Ash Other g 3.28
4 1008 Cheese, caraway Dairy and Egg Products Energy Energy kcal 376.00
#将两个DataFrame根据id列进行外连接
step6
result=detail.groupby(['nutrient','group'])['value'].quantile(0.5)
result[:10]
nutrient group
Adjusted Protein Sweets 12.900
Vegetables and Vegetable Products 2.180
Alanine Baby Foods 0.085
Baked Products 0.248
Beef Products 1.550
Beverages 0.003
Breakfast Cereals 0.311
Cereal Grains and Pasta 0.373
Dairy and Egg Products 0.271
Ethnic Foods 1.290
result['Alanine'].sort_values(ascending=False)
group
Beef Products 1.5500
Lamb, Veal, and Game Products 1.4080
Poultry Products 1.3940
Pork Products 1.3070
Ethnic Foods 1.2900
Finfish and Shellfish Products 1.2180
Sausages and Luncheon Meats 0.9420
Nut and Seed Products 0.7345
Spices and Herbs 0.5550
Fast Foods 0.5140
Restaurant Foods 0.4650
Snacks 0.4335
Legumes and Legume Products 0.4100
Cereal Grains and Pasta 0.3730
Meals, Entrees, and Sidedishes 0.3270
Breakfast Cereals 0.3110
Dairy and Egg Products 0.2710
Baked Products 0.2480
Sweets 0.1020
Baby Foods 0.0850
Vegetables and Vegetable Products 0.0840
Soups, Sauces, and Gravies 0.0650
Fruits and Fruit Juices 0.0270
Beverages 0.0030
Fats and Oils 0.0000
step7
result['Alanine'].sort_values(ascending=False)[:10].plot()
#找到所有食物种类的中含Alanine成分最多的前十种
更多推荐
四大会计师事务所python数据分析
发布评论