使用 Python Pandas 比较具有不同行数的两个 Excel 文件

编程入门行业动态更新时间:2024-10-26 17:30:06

本文介绍了使用 Python Pandas 比较具有不同行数的两个 Excel 文件的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！问题描述

我使用的是 Python 3.7，我想比较两个具有相同列(140 列)但行数不同的 Excel 文件，我查看了网站，但没有找到解决方案我的情况！

I'm using Python 3.7 , and I want to compare two Excel file that have the same columns (140 columns) but with a different number of rows, I looked on the website , but I didn't find a solution for my case!

这是一个例子:

df1 (old report) : id qte d1 d2 A 10 23 35 B 43 63 63 C 15 61 62 df2 (new report) : id qte d1 d2 A 20 23 35 C 15 61 62 E 38 62 16 F 63 20 51

结果应该是:

修改行必须为黄色，修改的值必须为红色

the modify rows must be in yellow and the value modified in red color

绿色的新行

删除的行红色

id qte d1 d2

A 20 23 35

C 15 61 62

B 43 63 63

E 38 62 16

F 63 20 51

代码:

import pandas as pd import numpy as np df1= pd.read_excel(r'C .....\data novembre.xlsx','Sheet1',na_values=['NA']) df2= pd.read_excel(r'C.....\data decembre.xlsx','Sheet1',na_values=['NA']) merged_data=df1.merge(df2, left_on = 'id', right_on = 'id', how = 'outer')

加入数据虽然不是我想要的！

Joining the data though is not want I want to have!

我刚刚开始学习 Python，所以我真的需要帮助！

I'm just starting to learn Python so I really need help!

推荐答案

一个 excel diff 可以很快变成一个时髦的野兽，但我们应该能够通过一些 concats 和布尔语句来做到这一点.

an excel diff can quickly become a funky beast, but we should be able to do this with some concats and boolean statements.

假设您的数据帧被称为 df1, df2

assuming your dataframes are called df1, df2

df1 = df1.set_index('id') df2 = df2.set_index('id') df3 = pd.concat([df1,df2],sort=False) df3a = df3.stack().groupby(level=[0,1]).unique().unstack(1).copy() df3a.loc[~df3a.index.isin(df2.index),'status'] = 'deleted' # if not in df2 index then deleted df3a.loc[~df3a.index.isin(df1.index),'status'] = 'new' # if not in df1 index then new idx = df3.stack().groupby(level=[0,1]).nunique() # get modified cells. df3a.loc[idx.mask(idx <= 1).dropna().index.get_level_values(0),'status'] = 'modified' df3a['status'] = df3a['status'].fillna('same') # assume that anything not fufilled by above rules is the same.

print(df3a) d1 d2 qte status id A [23] [35] [10, 20] modified B [63] [63] [43] deleted C [61] [62] [15] same E [62] [16] [38] new F [20] [51] [63] new

如果您不介意将所有数据类型转换为字符串对性能的影响，那么这可以工作.不过我不推荐它，使用事实或缓慢变化的维度模式来保存此类数据，您将来会感谢自己.

if you don't mind the performance hit of turning all your datatypes to strings then this could work. I dont' recommend it though, use a fact or slow changing dimension schema to hold such data, you'll thank your self in the future.

df3a.stack().explode().astype(str).groupby(level=[0,1]).agg('-->'.join).unstack(1) d1 d2 qte status id A 23 35 10-->20 modified B 63 63 43 deleted C 61 62 15 same E 62 16 38 new F 20 51 63 new

更多推荐

使用 Python Pandas 比较具有不同行数的两个 Excel 文件

本文发布于:2023-05-27 23:12:03，感谢您对本站的认可！

本文链接:https://www.elefans.com/category/jswz/34/306127.html