数据分析 第六讲 pandas"/>
数据分析 第六讲 pandas
文章目录
- 数据分析第六讲 pandas
- 一、pandas介绍
- 1.学习pandas的作用
- 2.pandas是什么?
- 二、pandas常用数据类型
- 1.Series一维,带标签数据
- 2.DataFrame二维,Series容器
- 三、pandas创建Series
- 1.根据数组创建
- 2.指定索引创建
- 3.通过字典来创建
- 4.通过ndarray创建
- 四、Series切片和索引
- 1.pandas的Series切片和索引
- 2.pandas中Series的索引和值
- 3.pandas中Series运算
- 五、pandas读取外部数据
- pandas读取外部数据
- 六、pandas-DataFrame
- 1.pandas中的DataFrame创建
- 1.1 类似多维数组,每列数据可以是不同的类型,索引包括行索引和列索引
- 1.2 Series能够传入字典,那么DataFrame能够传入字典作为数据么?
- 1.3 对于一个dataframe类型,既有行索引,又有列索引,我们能够对他做什么操作呢?
- 1.3.1DataFrame的基础属性
- 1.3.2DataFrame的整体情况查询
- 2.练习
- 3.pandas中的索引
- 4.pandas中DataFrame计算
- 5.pandas的布尔索引
- 6.pandas字符串常用方法
- 7.pandas中排序操作
- 8.pandas中缺失数据的处理
- 9.pandas中处理重复数据
- 10.pandas中数据替换
数据分析第六讲 pandas
一、pandas介绍
1.学习pandas的作用
- numpy已经能够帮助我们处理数据,能够结合matplotlib解决我们数据分析的问题,那么pandas学习的目的在什么地方呢?
numpy能够帮我们处理数值型数据,但是这还不够。很多时候,我们的数据除了数值之外,还有字符串,还有时间序列等
2.pandas是什么?
- pandas是基于NumPy 的一种工具,提供了高性能矩阵的运算,该工具是为了解决数据分析任务而创建的。
Pandas 纳入了大量库和一些标准的数据模型,提供了高效地操作大型数据集所需的工具。 - 2008年创建,最初被作为金融数据分析工具
- pandas安装 :pip install pandas -i --trusted-host pypi.douban.com
二、pandas常用数据类型
1.Series一维,带标签数据
2.DataFrame二维,Series容器
三、pandas创建Series
1.根据数组创建
2.指定索引创建
3.通过字典来创建
4.通过ndarray创建
Series预览数据
head() 默认打印前五条数据
tail() 默认打印后五条数据
Series-name属性
pd.Series([1,3,5,4,55],index=list(“abcde”),name=‘series’)
import pandas as pd # pip install pandas -i --trusted-host pypi.douban.com
import numpy as npt = pd.Series([1,2,3,4,5]) # 默认索引0开始
print(t)
'''
0 1
1 2
2 3
3 4
4 5
dtype: print(t4.astype('int'))'''
t1 = pd.Series([1,2,3,4,5],index=list("abcde")) # 指定索引
print(t1)
'''
a 1
b 2
c 3
d 4
e 5
dtype: int64'''
dict1 = {'name':"yangyu",'age':18,'sex':'man'}
t3 = pd.Series(dict1)
print(t3)
'''
name yangyu
age 18
sex man
dtype: object'''
print(np.random.rand(5)) # 随机生成5个0到1之间的小数
'''
dtype: object
[0.52727399 0.58752451 0.25906087 0.9116376 0.28573861]'''
t4 = pd.Series(np.random.rand(5))
print(t4)
'''
0 0.084299
1 0.076995
2 0.946208
3 0.728213
4 0.506164
dtype: float64
'''
# 修改类型
print(t4.astype('int')) # loat64转int类型
'''
0 0
1 0
2 0
3 0
4 0
dtype: int32'''
print(t.astype('float')) # int64转float类型
'''
0 1.0
1 2.0
2 3.0
3 4.0
4 5.0
dtype: float64'''# 预览数据
t5 = pd.Series(np.random.rand(100))
print(t5)
'''
0 0.008632
1 0.772691
2 0.422130
3 0.931042
4 0.467934...
95 0.804642
96 0.410508
97 0.865550
98 0.279784
99 0.562883
Length: 100, dtype: float64'''
print(t5.head()) # 默认预览前5行
'''
0 0.959057
1 0.279906
2 0.644710
3 0.628255
4 0.960321
dtype: float64'''
print(t5.tail()) # 默认预览后5行
'''
95 0.266070
96 0.579535
97 0.457201
98 0.520111
99 0.276324
dtype: float64
'''
t6 = pd.Series(np.random.rand(3),index=list('abc'),name='t6')
print(t6)
'''
a 0.478545
b 0.983166
c 0.407203
Name: t6, dtype: float64'''
print(t6.name) # t6
t6.index.name = 'Series'
print(t6)
'''
Series
a 0.993078
b 0.959705
c 0.843601
Name: t6, dtype: float64'''
t7 = pd.Series([1,3,5,4,55],index=list("abcde"),name='series')
print(t7)
'''
a 1
b 3
c 5
d 4
e 55
Name: series, dtype: int64'''
四、Series切片和索引
1.pandas的Series切片和索引
dict1 = {“name”:“yangyu”,“age”:18,“sex”:‘man’}
t1 = pd.Series(dict1)
1.通过键值
2.通过索引
3.t1.index
4.t1.values
# 取值
# 1.通过键值 t1['key'] t1.loc['key']
# 2.通过索引 t1[索引值] t1i.iloc[索引值]
import pandas as pddict1 = {"name":"yangyu","age":18,"sex":'man'}
t1 = pd.Series(dict1)
print(t1)
'''
name yangyu
age 18
sex man
dtype: object'''
print(t1['age']) # 18
print(t1[1]) # 18
print(t1.loc['age']) # 18
print(t1.iloc[1]) # 18
# 取前两行数据
print(t1[:2])
'''
name yangyu
age 18
dtype: object'''# 取第1行和第3行数据
print(t1[[0,2]]) # 通过索引来取值
'''
name yangyu
sex man
dtype: object'''
print(t1.iloc[[0,2]]) # # 通过索引函数来取值
'''
name yangyu
sex man
dtype: object'''
print(t1[['name','sex']]) # 通过键值来取值
'''
name yangyu
sex man
dtype: object'''
print(t1.values)
'''['yangyu' 18 'man']'''
print(t1.keys)
'''
<bound method Series.keys of name yangyu
age 18
sex man
dtype: object>'''
print(t1.index)
'''Index(['name', 'age', 'sex'], dtype='object')'''
t2 = pd.Series([1,2,3,4,5])
print(t2)
'''
0 1
1 2
2 3
3 4
4 5
dtype: int64'''
print(t2>3)
'''
0 False
1 False
2 False
3 True
4 True
dtype: bool'''
print(t2[t2>3])
'''
3 4
4 5
dtype: int64'''
# 判断的是key,不是value
print('name' in t1)
'True'
print('yangyu' in t1)
'False'# pandas会根据数据类型,自动处理缺失数据
data = ['a','b','c',None]
print(pd.Series(data))
'''
0 a
1 b
2 c
3 None
dtype: object
'''
data1 = [1,2,3,None]
print(pd.Series(data1))
'''
0 1.0
1 2.0
2 3.0
3 NaN
dtype: float64'''
2.pandas中Series的索引和值
对于一个陌生的series类型,我们如何知道他的索引和具体的值呢?
# pandas中Series的索引和值
import pandas as pdt = pd.Series([1,2,3,4,5],name='Series')
# 获取索引
print(t.index) # RangeIndex(start=0, stop=5, step=1)
print(t)
'''
0 1
1 2
2 3
3 4
4 5
Name: Series, dtype: int64'''
t1 = pd.Series([1,2,3,4,5],index=list("abcde"),name='Series')
# 获取索引
print(t1.index) # Index(['a', 'b', 'c', 'd', 'e'], dtype='object')
# 通过for循环获取索引
for i in t1.index:print(i)
'''
a
b
c
d
e'''
print(type(t1.index)) # <class 'pandas.core.indexes.base.Index'>
print(t1.index[1]) # b
# t1.index[1] = 'f' # 不允许修改,报错raise TypeError("Index does not support mutable operations")
# TypeError: Index does not support mutable operations# 索引重置
t1 = t1.reset_index()
print(t1)
'''index Series
0 a 1
1 b 2
2 c 3
3 d 4
4 e 5
'''
print(t1.index) # RangeIndex(start=0, stop=5, step=1)
t1 = t1.reset_index(drop=True)
print(t1.index) # RangeIndex(start=0, stop=5, step=1)
print(t1.values)
'''
[['a' 1]['b' 2]['c' 3]['d' 4]['e' 5]]
'''
print(type(t1.values)) # <class 'numpy.ndarray'>
3.pandas中Series运算
t = pd.Series(range(10,20),index=range(10))
t1 = pd.Series(range(20,25),index=range(5))
t + t1
# pandas中Series运算
import pandas as pdt = pd.Series(range(10,20),index=range(10))
print(t)
'''
0 10
1 11
2 12
3 13
4 14
5 15
6 16
7 17
8 18
9 19
dtype: int64'''
t1 = pd.Series(range(20,25),index=range(5))
print(t1)
'''
0 20
1 21
2 22
3 23
4 24'''
print(t+t1) # 对应的索引位相加为浮点数,其余的为NaN
'''
0 30.0
1 32.0
2 34.0
3 36.0
4 38.0
5 NaN
6 NaN
7 NaN
8 NaN
9 NaN
'''
t2 = pd.Series(range(10,20),index=range(0,20,2))
print(t2)
'''
0 10
2 11
4 12
6 13
8 14
10 15
12 16
14 17
16 18
18 19
dtype: int64
'''
print(t1)
'''
0 20
1 21
2 22
3 23
4 24
dtype: int64'''
print(t2+t1) # 对应的索引位相加为浮点数,其余的为NaN
'''
0 30.0
1 NaN
2 33.0
3 NaN
4 36.0
6 NaN
8 NaN
10 NaN
12 NaN
14 NaN
16 NaN
18 NaN
dtype: float64'''
五、pandas读取外部数据
pandas读取外部数据
我们的这组数据存在csv中,我们直接使用pd. read_csv即可
# pandas读取外部数据
import pandas as pddata = pd.read_csv('./catNames2.csv')
print(data)
print(type(data)) # <class 'pandas.core.frame.DataFrame'>
'''Row_Labels Count_AnimalName
0 1 1
1 2 2
2 40804 1
3 90201 1
4 90203 1
... ... ...
16215 37916 1
16216 38282 1
16217 38583 1
16218 38948 1
16219 39743 1[16220 rows x 2 columns]'''
data1 = pd.read_clipboard() # 从剪切板复制数据
print(data1)
六、pandas-DataFrame
1.pandas中的DataFrame创建
1.1 类似多维数组,每列数据可以是不同的类型,索引包括行索引和列索引
t = pd.DataFrame(np.arange(12).reshape(3,4))
0 1 2 3
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
DataFrame和Series有什么关系呢?
1.2 Series能够传入字典,那么DataFrame能够传入字典作为数据么?
dict_data = {
‘A’: 1.,
‘B’: date(year=2019,month=8,day=29),
‘C’: pd.Series(1, index=list(range(4)), dtype=‘float32’),
‘D’: np.array([3] * 4, dtype=‘int32’),
‘E’ : [‘Python’, ‘Java’, ‘C++’, ‘C#’],
‘F’ : ‘ChinaHadoop’
}
# pandas中的DataFrame创建
import pandas as pd
import numpy as np
from datetime import datet = pd.DataFrame(np.arange(12).reshape(3,4))
print(t)
'''0 1 2 3
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11'''
# 行索引 Index 0轴 axis = 0
# 列索引 columns 1轴 axis = 1
t1 = pd.DataFrame(np.arange(12).reshape(3,4),index=list('abc'),columns=list('wxyz'))
print(t1)
'''w x y z
a 0 1 2 3
b 4 5 6 7
c 8 9 10 11'''
# DataFrame和Series有什么关系呢? 容器
dict_data = {'A': 1.,'B': date(year=2021,month=3,day=22),'C': pd.Series(1, index=list(range(4)), dtype='float32'),'D': np.array([3] * 4, dtype='int32'),'E': ['Python', 'Java', 'C++', 'C#'],'F': 'ChinaHadoop'
}
t2 = pd.DataFrame(dict_data)
print(t2)
'''A B C D E F
0 1.0 2021-03-22 1.0 3 Python ChinaHadoop
1 1.0 2021-03-22 1.0 3 Java ChinaHadoop
2 1.0 2021-03-22 1.0 3 C++ ChinaHadoop
3 1.0 2021-03-22 1.0 3 C# ChinaHadoop
'''
dict_data1 = {'A': 1.,'B': date(year=2021,month=3,day=22),'C': pd.Series(1, index=list(range(5)), dtype='float32'),'D': np.array([3] * 5, dtype='int32'),'E': ['Python', 'Java', 'C++', 'C#', 'PHP'],'F': 'ChinaHadoop'
}
t3 = pd.DataFrame(dict_data1)
print(t3)
'''A B C D E F
0 1.0 2021-03-22 1.0 3 Python ChinaHadoop
1 1.0 2021-03-22 1.0 3 Java ChinaHadoop
2 1.0 2021-03-22 1.0 3 C++ ChinaHadoop
3 1.0 2021-03-22 1.0 3 C# ChinaHadoop
4 1.0 2021-03-22 1.0 3 PHP ChinaHadoop'''dict3 = {'name':['yangyu','king'],'age':[18,20],'address':['shanghai','chengdu']}
t4 = pd.DataFrame(dict3)
print(t4)
'''name age address
0 yangyu 18 shanghai
1 king 20 chengdu'''
print(type(t4)) # <class 'pandas.core.frame.DataFrame'>
dict4 = [{'name':'yangyu','age':18,'tel':10000},{'name':'king','tel':10001},{'name':'lilei','age':20}]
t5 = pd.DataFrame(dict4) # NaN会自动填充缺失的数据
print(t5)
'''name age tel
0 yangyu 18.0 10000.0
1 king NaN 10001.0
2 lilei 20.0 NaN
'''
1.3 对于一个dataframe类型,既有行索引,又有列索引,我们能够对他做什么操作呢?
1.3.1DataFrame的基础属性
df.shape # 行数 列数
df.dtypes # 列数据类型
df.ndim # 数据维度
df.index # 行索引
df.columns # 列索引
df.values # 对象值,二维ndarray数组
df.drop(columns=[‘name’,‘age’]) # 返回被删除之后的DataFrame,原数据不变
del df[‘name’]
# DataFrame的操作
import pandas as pddict1 = [{'name': 'yangyu', 'age': 18, 'tel': 10000}, {'name': 'king', 'tel': 10001}, {'name': 'lilei', 'age': 20}]
t1 = pd.DataFrame(dict1) # NaN会自动填充缺失的数据
print(t1)
'''name age tel
0 yangyu 18.0 10000.0
1 king NaN 10001.0
2 lilei 20.0 NaN'''
print(type(t1)) # <class 'pandas.core.frame.DataFrame'>
print(t1.shape) # (3, 3) 行 列
print(t1.dtypes) # 列的数据类型
'''
name object
age float64
tel float64
dtype: object'''
print(t1.ndim) # 维度 2
print(t1.index) # 行索引 RangeIndex(start=0, stop=3, step=1)
t2 = pd.DataFrame(dict1,index=list('abc'))
print(t2.index) # 行索引 Index(['a', 'b', 'c'], dtype='object')
print(t1.columns) # 列索引 Index(['name', 'age', 'tel'], dtype='object')
print(t1.values)
'''
[['yangyu' 18.0 10000.0]['king' nan 10001.0]['lilei' 20.0 nan]]'''
print(type(t1.values)) # <class 'numpy.ndarray'>
print(t2.drop(index='a'))
'''name age tel
b king NaN 10001.0
c lilei 20.0 NaN'''
print(t2.drop(index='a',columns='age'))
'''name tel
b king 10001.0
c lilei NaN'''
print(t2) # 本身并没有变化
'''name age tel
a yangyu 18.0 10000.0
b king NaN 10001.0
c lilei 20.0 NaN'''
# inplace 修改本身
t2.drop(index='a',columns='age',inplace=True)
print(t2)
'''name tel
b king 10001.0
c lilei NaN'''
del t1['name'] # 删除整列
print(t1)
'''age tel
0 18.0 10000.0
1 NaN 10001.0
2 20.0 NaN'''
del t1 # 删除全部
print(t1) # NameError: name 't1' is not defined
1.3.2DataFrame的整体情况查询
t2.head(3) # 显示头部几行,默认5行
t2.tail(3) # 显示末尾几行,默认5行
t2.info() # 相关信息概述
t2.describe() # 快速综合统计结果
# DataFrame的整体情况查询
import pandas as pddict1 = [{'name': 'yangyu', 'age': 18, 'tel': 10000}, {'name': 'king', 'tel': 10001}, {'name': 'lilei', 'age': 20},{'name': 'yangyu', 'age': 18, 'tel': 10000},{'name': 'yangyu', 'age': 18, 'tel': 10000},{'name': 'yangyu', 'age': 18, 'tel': 10000},]
t1 = pd.DataFrame(dict1) # NaN会自动填充缺失的数据
print(t1)
'''name age tel
0 yangyu 18.0 10000.0
1 king NaN 10001.0
2 lilei 20.0 NaN
3 yangyu 18.0 10000.0
4 yangyu 18.0 10000.0
5 yangyu 18.0 10000.0
'''
print(t1.head()) # 默认预览前5行
'''name age tel
0 yangyu 18.0 10000.0
1 king NaN 10001.0
2 lilei 20.0 NaN
3 yangyu 18.0 10000.0
4 yangyu 18.0 10000.0'''
print(t1.tail()) # 默认预览后5行
'''name age tel
1 king NaN 10001.0
2 lilei 20.0 NaN
3 yangyu 18.0 10000.0
4 yangyu 18.0 10000.0
5 yangyu 18.0 10000.0'''
print(t1.info()) # 查看信息概述
'''
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 3 columns):# Column Non-Null Count Dtype
--- ------ -------------- ----- 0 name 6 non-null object 1 age 5 non-null float642 tel 5 non-null float64
dtypes: float64(2), object(1)
memory usage: 272.0+ bytes
None'''
print(t1.describe()) # 统计结果
'''age tel
count 5.000000 5.000000
mean 18.400000 10000.200000
std 0.894427 0.447214
min 18.000000 10000.000000
25% 18.000000 10000.000000
50% 18.000000 10000.000000
75% 18.000000 10000.000000
max 20.000000 10001.000000
'''
2.练习
现在假设我们有一个组关于猫的名字的统计数据,想知道使用次数最高的前几个名字是什么呢?
df.sort_values(by="",ascending=False)
# 现在假设我们有一个组关于猫的名字的统计数据,想知道使用次数最高的前几个名字是什么呢?
import pandas as pd# 排序
df = pd.read_csv('catNames2.csv')
# print(df)# 默认是升序 by 通过哪个列来进行排序
# print(df.sort_values(by='Count_AnimalName')) # 升序 默认ascending=True
# print(df.sort_values(by='Count_AnimalName',ascending=False)) # 倒序
df.sort_values(by='Count_AnimalName',ascending=False,inplace=True) # 改变数据本身
print(df.info())
'''
<class 'pandas.core.frame.DataFrame'>
Int64Index: 16220 entries, 1156 to 16219
Data columns (total 2 columns):# Column Non-Null Count Dtype
--- ------ -------------- ----- 0 Row_Labels 16217 non-null object1 Count_AnimalName 16220 non-null int64
dtypes: int64(1), object(1)
memory usage: 380.2+ KB
None
'''
print(df)
'''Row_Labels Count_AnimalName
1156 BELLA 1195
9140 MAX 1153
2660 CHARLIE 856
3251 COCO 852
12368 ROCKY 823
... ... ...
6884 J-LO 1
6888 JOANN 1
6890 JOAO 1
6891 JOAQUIN 1
16219 39743 1[16220 rows x 2 columns]'''
3.pandas中的索引
- pandas取行取列
方括号写数字,表示取行,对行进行操作
写字符串,表示的取列索引,对列进行操作
import pandas as pd# 排序
df = pd.read_csv('catNames2.csv')
# print(df)# 默认是升序 by 通过哪个列来进行排序
# print(df.sort_values(by='Count_AnimalName')) # 升序 默认ascending=True
# print(df.sort_values(by='Count_AnimalName',ascending=False)) # 倒序
df.sort_values(by='Count_AnimalName',ascending=False,inplace=True) # 改变数据本身
# 取行 前3行
print(df[:2]) # 或者 print(df.head(2))
'''Row_Labels Count_AnimalName
1156 BELLA 1195
9140 MAX 1153'''
# 通过列的索引值取列
print(df['Count_AnimalName'])
'''
1156 1195
9140 1153
2660 856
3251 852
12368 823...
6884 1
6888 1
6890 1
6891 1
16219 1
'''
# 同时取行取列
print(df[:2]['Row_Labels'])
'''
1156 BELLA
9140 MAX
Name: Row_Labels, dtype: object'''
print(type(df[:2]['Row_Labels'])) # <class 'pandas.core.series.Series'>
- 还有更多的经过pandas优化过的选择方式
df.loc 通过标签索引行数据
df.iloc 通过位置获取行数据
# 还有更多的经过pandas优化过的选择方式
# df.loc 通过标签索引行数据
# df.iloc 通过位置获取行数据
import pandas as pd
import numpy as npt = pd.DataFrame(np.arange(12).reshape(3, 4), index=list('abc'), columns=list('wxyz'))
print(t)
'''w x y z
a 0 1 2 3
b 4 5 6 7
c 8 9 10 11'''
# 取行
print(t.loc['a'])
print(type(t.loc['a']))
'''
w 0
x 1
y 2
z 3
Name: a, dtype: int32
<class 'pandas.core.series.Series'>'''
# 取列
print(t.loc[:, 'z'])
print(type(t.loc[:, 'z']))
'''
a 3
b 7
c 11
Name: z, dtype: int32
<class 'pandas.core.series.Series'>'''
# 取多行
print(t.loc[['a', 'c']])
print(type(t.loc[['a', 'c']]))
'''w x y z
a 0 1 2 3
c 8 9 10 11
<class 'pandas.core.frame.DataFrame'>'''
# 取多行
print(t.iloc[[0, 2]])
print(type(t.iloc[[0, 2]]))
'''w x y z
a 0 1 2 3
c 8 9 10 11
<class 'pandas.core.frame.DataFrame'>'''
# 取多列
print(t.loc[:, ['w', 'z']])
print(type(t.loc[:, ['w', 'z']]))
'''w z
a 0 3
b 4 7
c 8 11
<class 'pandas.core.frame.DataFrame'>'''
# 取多列
print(t.iloc[:, [0, 3]])
print(type(t.iloc[:, [0, 3]]))
'''w z
a 0 3
b 4 7
c 8 11
<class 'pandas.core.frame.DataFrame'>'''
# 取某个值
print(t.iloc[0, 0]) # 0
t.iloc[0, 0] = 100
print(t)
'''w x y z
a 100 1 2 3
b 4 5 6 7
c 8 9 10 11'''
t.iloc[0, 0] = np.nan # dataframe不需要转换为float类型就可以赋值为nan
print(t)
'''w x y z
a NaN 1 2 3
b 4.0 5 6 7
c 8.0 9 10 11'''
4.pandas中DataFrame计算
pandas中DataFrame计算
t = pd.DataFrame(np.ones((2,2)),columns=[‘a’,‘b’])
t1 = pd.DataFrame(np.ones((3,3)),columns=[‘a’,‘b’,‘c’])
t2 = pd.Series(range(20,25),index=range(5))
t = pd.DataFrame(np.ones((2,2)),columns=[‘a’,‘b’])
# pandas中DataFrame计算
import pandas as pd
import numpy as npt = pd.DataFrame(np.ones((2,2)),columns=['a','b'])
t1 = pd.DataFrame(np.ones((3,3)),columns=['a','b','c'])
print(t)
'''a b
0 1.0 1.0
1 1.0 1.0'''
print(t1)
'''a b c
0 1.0 1.0 1.0
1 1.0 1.0 1.0
2 1.0 1.0 1.0'''
print(t + t1)
'''a b c
0 2.0 2.0 NaN
1 2.0 2.0 NaN
2 NaN NaN NaN'''
t2 = pd.Series(range(20,25),index=range(5))
print(t2)
'''
0 20
1 21
2 22
3 23
4 24
dtype: int64'''
print(t)
'''a b
0 1.0 1.0
1 1.0 1.0'''
print(t2 + t)
'''a b 0 1 2 3 4
0 NaN NaN NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN NaN NaN'''
5.pandas的布尔索引
假如我们想找到所有的使用次数超过800的猫的名字,应该怎么选择?
假如我们想找到所有的使用次数超过700并且名字的字符串的长度大于4的狗的名字,应该怎么选择?
# pandas的布尔索引
import pandas as pd
import numpy as np
# 假如我们想找到所有的使用次数超过800的猫的名字,应该怎么选择?df = pd.read_csv('catNames2.csv')
print(df[df['Count_AnimalName']>800])
'''Row_Labels Count_AnimalName
1156 BELLA 1195
2660 CHARLIE 856
3251 COCO 852
9140 MAX 1153
12368 ROCKY 823'''# 假如我们想找到所有的使用次数超过700并且名字的字符串的长度大于4的狗的名字,应该怎么选择?
print(df[(df['Count_AnimalName']>700) & (df['Count_AnimalName']<1000)])
'''Row_Labels Count_AnimalName
2660 CHARLIE 856
3251 COCO 852
8417 LOLA 795
8552 LUCKY 723
8560 LUCY 710
12368 ROCKY 823'''
print(df[(df['Count_AnimalName']>700) & (df['Row_Labels'].str.len()>4)])
'''Row_Labels Count_AnimalName
1156 BELLA 1195
2660 CHARLIE 856
8552 LUCKY 723
12368 ROCKY 823'''
6.pandas字符串常用方法
方法 说明
- contains
返回表示各字符串是否含有指定模式的布尔
型数组 - lower,upper 转换大小写
- slice 对series中的各个字符串进行子串截取
- split 根据分隔符或正则表达式对字符串进行拆分
7.pandas中排序操作
按索引排序,sort_index()
按值排序,sort_values(by,ascending)
按单列的值排序
by=‘label’
ascending:True升序,False降序
# pandas中排序操作
import pandas as pd
import numpy as npt = pd.DataFrame(np.arange(12).reshape(3,4))
print(t)
'''0 1 2 3
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
'''
print(t.sort_index())
'''0 1 2 3
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11'''
print(t.sort_index(ascending=False)) # 索引倒序
'''0 1 2 3
2 8 9 10 11
1 4 5 6 7
0 0 1 2 3'''
t = pd.DataFrame(np.arange(12).reshape(3,4), index=list('abc'))
print(t.sort_index(ascending=False)) # 索引倒序 # a 97 b 98 c 99 A 65 B 66
'''0 1 2 3
c 8 9 10 11
b 4 5 6 7
a 0 1 2 3'''
t = pd.DataFrame(np.arange(12).reshape(3,4), index=list('aBc'))
print(t.sort_index(ascending=False)) # 索引倒序
'''0 1 2 3
c 8 9 10 11
a 0 1 2 3
B 4 5 6 7'''
print(t.sort_index(ascending=False, axis=1)) # 1轴倒序
'''3 2 1 0
a 3 2 1 0
B 7 6 5 4
c 11 10 9 8'''
print(t.sort_values(by=0)) # 0列正序
'''0 1 2 3
a 0 1 2 3
B 4 5 6 7
c 8 9 10 11'''
print(t.sort_values(by=0,ascending=False)) # 0列都倒序
'''0 1 2 3
c 8 9 10 11
B 4 5 6 7
a 0 1 2 3'''
t = pd.DataFrame(np.arange(12).reshape(3,4), index=list('aBc'),columns=list('efgh'))
print(t.sort_values(by='e',ascending=False)) # 'e'列倒序
'''e f g h
c 8 9 10 11
B 4 5 6 7
a 0 1 2 3'''
print(t.sort_values(by='a',ascending=False, axis=1)) # 'a'行倒序
'''h g f e
a 3 2 1 0
B 7 6 5 4
c 11 10 9 8'''
# pandas中的排序import pandas as pd
import numpy as npdf = pd.DataFrame({'col1': ['A', 'A', 'B', np.nan, 'D', 'C'],'col2': [2, 1, 9, 8, 7, 7],'col3': [0, 1, 5, 4, 8, 2]})
print(df)
'''col1 col2 col3
0 A 2 0
1 A 1 1
2 B 9 5
3 NaN 8 4
4 D 7 8
5 C 7 2'''
print(df.sort_values(by=['col2'],ascending=False)) # 按col2列进行倒序,col2最大值所在的行变为第1行,依次排序。
'''col1 col2 col3
2 B 9 5
3 NaN 8 4
4 D 7 8
5 C 7 2
0 A 2 0
1 A 1 1'''
print(df.sort_values(by=['col2','col3'])) # 按col2列、col3列升序排序(col2列中出现相同的值,col3不同的值就表现出升序)
'''col1 col2 col3
1 A 1 1
0 A 2 0
5 C 7 2
4 D 7 8
3 NaN 8 4
2 B 9 5'''
8.pandas中缺失数据的处理
我们的数据缺失通常有两种情况:
一种就是空,None等,在pandas是NaN(和np.nan一样)
另一种是我们让其为0,蓝色框中
- pandas中缺失数据的处理
判断数据是否为NaN:pd.isnull(df),pd.notnull(df)
处理方式1:删除NaN所在的行列dropna (axis=0, how=‘any’, inplace=False)
处理方式2:填充数据,t.fillna(t.mean()),t.fiallna(t.median()),t.fillna(0)
# pandas中缺失数据的处理
'''
判断数据是否为NaN:pd.isnull(df),pd.notnull(df)
处理方式1:删除NaN所在的行列t.dropna(axis=0, how='any', inplace=False)
处理方式2:填充数据,t.fillna(t.mean()),t.fiallna(t.median()),t.fillna(0)
'''
import pandas as pd
import numpy as npt = pd.DataFrame(np.arange(24).reshape(4,6),dtype=float,index=list('ABCD'),columns=list('UVWXYZ'))
print(t)
'''U V W X Y Z
A 0.0 1.0 2.0 3.0 4.0 5.0
B 6.0 7.0 8.0 9.0 10.0 11.0
C 12.0 13.0 14.0 15.0 16.0 17.0
D 18.0 19.0 20.0 21.0 22.0 23.0'''
t.iloc[0,0] = np.nan
t.iloc[0,5] = np.nan
t.iloc[3,2] = np.nan
t.iloc[1,4] = 0
print(t)
'''U V W X Y Z
A NaN 1.0 2.0 3.0 4.0 NaN
B 6.0 7.0 8.0 9.0 0.0 11.0
C 12.0 13.0 14.0 15.0 16.0 17.0
D 18.0 19.0 NaN 21.0 22.0 23.0'''
print(pd.isnull(t))
'''U V W X Y Z
A True False False False False True
B False False False False False False
C False False False False False False
D False False True False False False'''
print(pd.notnull(t))
'''
A False True True True True False
B True True True True True True
C True True True True True True
D True True False True True True'''
# how="all" 必须满足该行或者该列全为NaN,才删除整行或整列
print(t.dropna()) # 默认参数:axis=0, how="any", thresh=None, subset=None, inplace=False 删除NaN所在行
'''U V W X Y Z
B 6.0 7.0 8.0 9.0 0.0 11.0
C 12.0 13.0 14.0 15.0 16.0 17.0'''
print(t.dropna(axis=1)) # 删除NaN所在列
'''V X Y
A 1.0 3.0 4.0
B 7.0 9.0 0.0
C 13.0 15.0 16.0
D 19.0 21.0 22.0'''print(t)
'''U V W X Y Z
A NaN 1.0 2.0 3.0 4.0 NaN
B 6.0 7.0 8.0 9.0 0.0 11.0
C 12.0 13.0 14.0 15.0 16.0 17.0
D 18.0 19.0 NaN 21.0 22.0 23.0'''
print(t.fillna(2)) # 用数字2来填充NaN
'''U V W X Y Z
A 2.0 1.0 2.0 3.0 4.0 2.0
B 6.0 7.0 8.0 9.0 0.0 11.0
C 12.0 13.0 14.0 15.0 16.0 17.0
D 18.0 19.0 2.0 21.0 22.0 23.0'''
print(t.fillna(t.mean())) # 全部用该列的中值来填充NaN
'''U V W X Y Z
A 12.0 1.0 2.0 3.0 4.0 17.0
B 6.0 7.0 8.0 9.0 0.0 11.0
C 12.0 13.0 14.0 15.0 16.0 17.0
D 18.0 19.0 8.0 21.0 22.0 23.0'''
print("="*30)
print(t)
'''U V W X Y Z
A NaN 1.0 2.0 3.0 4.0 NaN
B 6.0 7.0 8.0 9.0 0.0 11.0
C 12.0 13.0 14.0 15.0 16.0 17.0
D 18.0 19.0 NaN 21.0 22.0 23.0'''
t['Z'] = t['Z'].fillna(t['Z'].mean()) # 单列用该列的中值来填充NaN
print(t)
''' U V W X Y Z
A NaN 1.0 2.0 3.0 4.0 17.0
B 6.0 7.0 8.0 9.0 0.0 11.0
C 12.0 13.0 14.0 15.0 16.0 17.0
D 18.0 19.0 NaN 21.0 22.0 23.0'''
t['W'] = t['W'].fillna(t['X'].mean()) # 单列用别的列的中值来填充NaN
print("="*30)
print(t)
''' U V W X Y Z
A NaN 1.0 2.0 3.0 4.0 17.0
B 6.0 7.0 8.0 9.0 0.0 11.0
C 12.0 13.0 14.0 15.0 16.0 17.0
D 18.0 19.0 12.0 21.0 22.0 23.0'''
9.pandas中处理重复数据
data = pd.DataFrame(
{
‘age’:[28,31,27,28],
‘gender’:[‘M’,‘M’,‘M’,‘F’],
‘name’:[‘Liu’,‘Li’,‘Chen’,‘Liu’]
}
)
判断是否存在重复数据
data.duplicated()
pandas中处理重复数据
删除重复数据
data.drop_duplicated()
subset 指定某些列
keep 保留第一次出现的数据
# pandas中处理重复数据
'''
判断数据是否为NaN:pd.isnull(df),pd.notnull(df)
处理方式1:删除NaN所在的行列t.dropna(axis=0, how='any', inplace=False)
处理方式2:填充数据,t.fillna(t.mean()),t.fiallna(t.median()),t.fillna(0)
'''
import pandas as pd
import numpy as npdata = pd.DataFrame({'age': [28, 31, 27, 28],'gender': ['M', 'M', 'M', 'F'],'name': ['Liu', 'Li', 'Chen', 'Liu']}
)
print(data)
'''age gender name
0 28 M Liu
1 31 M Li
2 27 M Chen
3 28 F Liu'''# 判断是否存在重复数据 data.duplicated()
print(data.duplicated()) # 判断整行是否存在重复数据
'''
0 False
1 False
2 False
3 False
dtype: bool'''
data1 = pd.DataFrame({'age': [28, 31, 27, 28],'gender': ['M', 'M', 'M', 'M'],'name': ['Liu', 'Li', 'Chen', 'Liu']}
)
print(data1.duplicated()) # 判断整行是否存在重复数据
'''
0 False
1 False
2 False
3 True
dtype: bool'''
# 判断年龄和姓名是否存在重复数据
print(data.duplicated(subset=['age','name']))
'''
0 False
1 False
2 False
3 True
dtype: bool'''
data.drop_duplicates(inplace=True) # 如果整行没有重复,删除命令不执行
print(data)
'''age gender name
0 28 M Liu
1 31 M Li
2 27 M Chen
3 28 F Liu'''
data.drop_duplicates(subset=['age','name'],inplace=True) # 删除指定项重复的数据
print(data)
'''age gender name
0 28 M Liu
1 31 M Li
2 27 M Chen'''
data1.drop_duplicates(inplace=True) # 删除整行重复的数据
print(data1)
'''age gender name
0 28 M Liu
1 31 M Li
2 27 M Chen'''
print("=="*30)
data2 = pd.DataFrame({'age': [28, 31, 27, 28],'gender': ['M', 'M', 'M', 'M'],'name': ['Liu', 'Li', 'Chen', 'Liu']}
)
data2.drop_duplicates(inplace=True,keep='last') # 删除前面重复数据,保留最后面的 # 默认first保留前面的
print(data2)
'''age gender name
1 31 M Li
2 27 M Chen
3 28 M Liu'''
10.pandas中数据替换
replace(to_replace)
to_replace为需要被替换的值,可以是
数值,字符串
列表
字典
# pandas中数据替换
'''
replace(to_replace)
to_replace为需要被替换的值,可以是
数值,字符串
列表
字典
'''
import pandas as pd
import numpy as npdata = pd.DataFrame({'age': [28, 31, 27, 28],'gender': ['M', 'M', 'M', 'F'],'name': ['Liu', 'Li', 'Chen', 'Liu']}
)print(data.replace(28, 30)) # 把28都替换为30
'''age gender name
0 30 M Liu
1 31 M Li
2 27 M Chen
3 30 F Liu
'''
print(data.replace('M', 30)) # 把'M'都替换为30
'''age gender name
0 28 30 Liu
1 31 30 Li
2 27 30 Chen
3 28 F Liu'''
print(data.replace(['M', 'F'], 30)) # 把'M','F'都替换为30
'''
0 28 30 Liu
1 31 30 Li
2 27 30 Chen
3 28 30 Liu'''
print(data.replace(['M', 'F'], ['MAN', 'FAN'])) # 把'M'替换为'MAN',把'F'替换为'FAN'
'''age gender name
0 28 MAN Liu
1 31 MAN Li
2 27 MAN Chen
3 28 FAN Liu'''
print(data.replace({28: 30, 'M': 'MAN'})) # 用字典的方式把28替换为30,把'M'替换为'MAN'
'''age gender name
0 30 MAN Liu
1 31 MAN Li
2 27 MAN Chen
3 30 F Liu'''
更多推荐
数据分析 第六讲 pandas
发布评论