[英]Python pandas Dataframe comparison
People of stack overflow, help!堆栈溢出的人,救命!
I have a leetcode style problem for you guys.我有一个 leetcode 风格问题给你们。
Imagine a scenario where you have 2 2D arrays, more specifically 2 Dataframes with pandas.想象一个场景,您有 2 个 2D arrays,更具体地说是 2 个具有 pandas 的数据帧。
I need to compare these 2 Dataframes and highlight all the differences, however there is a catch.我需要比较这两个数据框并突出显示所有差异,但是有一个问题。 Rows can be missing from these data frames which makes this inherently a lot more difficult, as well as missing cells too.
这些数据帧中可能会丢失行,这使得这本质上变得更加困难,并且也会丢失单元格。 I'll provide an example.
我将提供一个例子。
import pandas as pd
x = [[0, 1, 2, 3],[4, 5, 6, 7],[8, 9, 10, 11],[12, 13, 14, 15]]
y = [[nan, 1, 2, 3],[4, 5, 6, nan],[12, 13, 14, 15]]
df1 = pd.DataFrame(x)
df2 = pd.DataFrame(y)
How can I identify all of the missing cells AND the missing rows?如何识别所有缺失的单元格和缺失的行?
Bonus points if you can create code to highlight the differences and export them to an excel sheet;)如果您可以创建代码以突出显示差异并将它们导出到 excel 工作表,则可以加分;)
Stage 1阶段1
A good starting point would be the following StackOverflow question: https://stackoverflow.com/a/48647840/15965988一个好的起点是以下 StackOverflow 问题: https://stackoverflow.com/a/48647840/15965988
This would remove 100% duplicate rows from both tables.这将从两个表中删除 100% 的重复行。
Stage 2第二阶段
At this stage, only rows with differences exist.在这个阶段,只存在有差异的行。 From here I'd recommend looping over each row.
从这里我建议循环遍历每一行。 For each row you'll need to create some logic that queries the other dataframe looking for a similar row.
对于每一行,您需要创建一些逻辑来查询其他 dataframe 以寻找类似的行。 During that query consider querying with only some columns.
在该查询期间,请考虑仅使用某些列进行查询。
Best of luck.祝你好运。
Slightly tweaking your example data, lets define the following dataframes:稍微调整您的示例数据,让我们定义以下数据框:
import pandas as pd
import numpy as np
x = [[0, 1, 2, 3],[4, 5, 6, 7],[8, 9, 10, 11],[12, 13, 14, 15]]
y = [[4, 5, 6, 99],[8, 9, np.nan, 11],[12, 13, 14, 15]]
df_ref = pd.DataFrame(x, index=range(4), columns=["a", "b","c","d"])
df = pd.DataFrame(y, index=[1,2,5], columns=["a", "b","c","d"])
df_ref
is your "reference" dataframe. df_ref
是您的“参考”dataframe。
and "df" the dataframe you are comparing it to.和“df”你正在比较的 dataframe。
The differences are:区别在于:
The following solution highlights:以下解决方案重点介绍:
def get_dataframes_diff(df: pd.DataFrame, df_ref: pd.DataFrame, path_excel = None):
rows_new = df.index[~df.index.isin(df_ref.index)]
rows_del = df_ref.index[~df_ref.index.isin(df.index)]
rows_common = df_ref.index.intersection(df.index)
df_diff = pd.concat([df, df_ref.loc[rows_del]]).sort_index()
s = df_diff.style
def format_row(row, color: str = "white", bg_color: str = "green"):
return [f"color: {color}; background-color: {bg_color}"] * len(row)
s.apply(format_row, subset = (rows_new, df.columns), color="white", bg_color="green", axis=1)
s.apply(format_row, subset = (rows_del, df.columns), color="white", bg_color="red", axis=1)
mask = pd.DataFrame(True, index=df_diff.index, columns=df_diff.columns)
mask.loc[rows_same] = (df_ref.loc[rows_same] == df.loc[rows_same])
mask.replace(True, None, inplace=True)
mask.replace(False, "color: black; background-color: orange;", inplace=True)
s.apply(lambda _: mask, axis=None)
if path_excel is not None:
s.to_excel(path_excel)
return s
It gives:它给:
get_dataframes_diff(df, df_ref)
get the list of deleted rows, new rows and those in common获取已删除行、新行和共有行的列表
rows_new = df.index[~df.index.isin(df_ref.index)]
rows_del = df_ref.index[~df_ref.index.isin(df.index)]
rows_same = df_ref.index.intersection(df.index)
create a "diff" dataframe, by adding the deleted rows to the df
dataframe通过将删除的行添加到
df
dataframe,创建一个“差异”dataframe
df_diff = pd.concat([df, df_ref.loc[rows_del]]).sort_index()
Use Styler.apply
to highlight in green the new rows, and red the lines deleted (note the use of the subset
argument):使用
Styler.apply
以绿色突出显示新行,红色突出显示已删除的行(注意subset
参数的使用):
def format_row(row, color: str = "white", bg_color: str = "green"):
return [f"color: {color}; background-color: {bg_color}"] * len(row)
df_diff.style.apply(format_row, subset = (rows_new, df.columns), color="white", bg_color="green", axis=1)
df_diff.style.apply(format_row, subset = (rows_del, df.columns), color="white", bg_color="red", axis=1)
To highlight value differences for common rows, create a mask dataframe which equals True for elements that are the same, and False when values differ要突出显示常见行的值差异,请创建一个掩码 dataframe,对于相同的元素,它等于 True,当值不同时,它等于 False
mask = pd.DataFrame(True, index=df_diff.index, columns=df_diff.columns)
mask.loc[rows_common] = (df_ref.loc[rows_common] == df.loc[rows_common])
When True (same value), we don't apply any styling.当 True(相同的值)时,我们不应用任何样式。 When False, we highlight in orange:
当为 False 时,我们以橙色突出显示:
mask.replace(True, None, inplace=True)
mask.replace(False, "color: black; background-color: orange;", inplace=True)
df_diff.style.apply(lambda _: mask, axis=None)
Finally if you want to save it as an excel file, provide a valid path to the path_excel
argument.最后,如果要将其保存为 excel 文件,请提供
path_excel
参数的有效路径。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.