简体   繁体   English

Python pandas Dataframe 比较

[英]Python pandas Dataframe comparison

People of stack overflow, help!堆栈溢出的人,救命!

I have a leetcode style problem for you guys.我有一个 leetcode 风格问题给你们。

Imagine a scenario where you have 2 2D arrays, more specifically 2 Dataframes with pandas.想象一个场景,您有 2 个 2D arrays,更具体地说是 2 个具有 pandas 的数据帧。

I need to compare these 2 Dataframes and highlight all the differences, however there is a catch.我需要比较这两个数据框并突出显示所有差异,但是有一个问题。 Rows can be missing from these data frames which makes this inherently a lot more difficult, as well as missing cells too.这些数据帧中可能会丢失行,这使得这本质上变得更加困难,并且也会丢失单元格。 I'll provide an example.我将提供一个例子。

import pandas as pd

x = [[0, 1, 2, 3],[4, 5, 6, 7],[8, 9, 10, 11],[12, 13, 14, 15]]
y = [[nan, 1, 2, 3],[4, 5, 6, nan],[12, 13, 14, 15]]

df1 = pd.DataFrame(x)
df2 = pd.DataFrame(y)

How can I identify all of the missing cells AND the missing rows?如何识别所有缺失的单元格和缺失的行?

Bonus points if you can create code to highlight the differences and export them to an excel sheet;)如果您可以创建代码以突出显示差异并将它们导出到 excel 工作表,则可以加分;)

Stage 1阶段1
A good starting point would be the following StackOverflow question: https://stackoverflow.com/a/48647840/15965988一个好的起点是以下 StackOverflow 问题: https://stackoverflow.com/a/48647840/15965988

This would remove 100% duplicate rows from both tables.这将从两个表中删除 100% 的重复行。

Stage 2第二阶段
At this stage, only rows with differences exist.在这个阶段,只存在有差异的行。 From here I'd recommend looping over each row.从这里我建议循环遍历每一行。 For each row you'll need to create some logic that queries the other dataframe looking for a similar row.对于每一行,您需要创建一些逻辑来查询其他 dataframe 以寻找类似的行。 During that query consider querying with only some columns.在该查询期间,请考虑仅使用某些列进行查询。

Best of luck.祝你好运。

example dataset示例数据集

Slightly tweaking your example data, lets define the following dataframes:稍微调整您的示例数据,让我们定义以下数据框:

import pandas as pd
import numpy as np
x = [[0, 1, 2, 3],[4, 5, 6, 7],[8, 9, 10, 11],[12, 13, 14, 15]]
y = [[4, 5, 6, 99],[8, 9, np.nan, 11],[12, 13, 14, 15]]

df_ref = pd.DataFrame(x, index=range(4), columns=["a", "b","c","d"])
df = pd.DataFrame(y, index=[1,2,5], columns=["a", "b","c","d"])

df_ref is your "reference" dataframe. df_ref是您的“参考”dataframe。

在此处输入图像描述

and "df" the dataframe you are comparing it to.和“df”你正在比较的 dataframe。

在此处输入图像描述

The differences are:区别在于:

  • rows 0 and 3 missing缺少第 0 行和第 3 行
  • a new row (5)新行 (5)
  • (0, "d") is equal to 99 instead of 3 (0, "d") 等于 99 而不是 3
  • (2, "c") is NaN instead of 10 (2, "c") 是 NaN 而不是 10

solution解决方案

The following solution highlights:以下解决方案重点介绍:

  • [in red] the "deleted rows" (row indexes that don't appear in df) [红色] “已删除行”(未出现在 df 中的行索引)
  • [in green] the "new rows" (row indexes that don't appear in df_ref) [绿色] “新行”(未出现在 df_ref 中的行索引)
  • [in orange] the values that differ for common rows [橙色] 常见行的不同值
def get_dataframes_diff(df: pd.DataFrame, df_ref: pd.DataFrame, path_excel = None):
    rows_new = df.index[~df.index.isin(df_ref.index)]
    rows_del = df_ref.index[~df_ref.index.isin(df.index)]
    rows_common = df_ref.index.intersection(df.index)

    df_diff = pd.concat([df, df_ref.loc[rows_del]]).sort_index()
    s = df_diff.style

    def format_row(row, color: str = "white", bg_color: str = "green"):
        return [f"color: {color}; background-color: {bg_color}"] * len(row)

    s.apply(format_row, subset = (rows_new, df.columns), color="white", bg_color="green", axis=1)
    s.apply(format_row, subset = (rows_del, df.columns), color="white", bg_color="red", axis=1)

    mask = pd.DataFrame(True, index=df_diff.index, columns=df_diff.columns)
    mask.loc[rows_same] = (df_ref.loc[rows_same] == df.loc[rows_same])
    mask.replace(True, None, inplace=True)
    mask.replace(False, "color: black; background-color: orange;", inplace=True)

    s.apply(lambda _: mask, axis=None)

    if path_excel is not None:
        s.to_excel(path_excel)
    return s  

It gives:它给:

get_dataframes_diff(df, df_ref)

在此处输入图像描述

explanation解释

get the list of deleted rows, new rows and those in common获取已删除行、新行和共有行的列表

rows_new = df.index[~df.index.isin(df_ref.index)]
rows_del = df_ref.index[~df_ref.index.isin(df.index)]
rows_same = df_ref.index.intersection(df.index)

create a "diff" dataframe, by adding the deleted rows to the df dataframe通过将删除的行添加到df dataframe,创建一个“差异”dataframe

df_diff = pd.concat([df, df_ref.loc[rows_del]]).sort_index()

Use Styler.apply to highlight in green the new rows, and red the lines deleted (note the use of the subset argument):使用Styler.apply以绿色突出显示新行,红色突出显示已删除的行(注意subset参数的使用):

def format_row(row, color: str = "white", bg_color: str = "green"):
    return [f"color: {color}; background-color: {bg_color}"] * len(row)

df_diff.style.apply(format_row, subset = (rows_new, df.columns), color="white", bg_color="green", axis=1)
df_diff.style.apply(format_row, subset = (rows_del, df.columns), color="white", bg_color="red", axis=1)

To highlight value differences for common rows, create a mask dataframe which equals True for elements that are the same, and False when values differ要突出显示常见行的值差异,请创建一个掩码 dataframe,对于相同的元素,它等于 True,当值不同时,它等于 False

mask = pd.DataFrame(True, index=df_diff.index, columns=df_diff.columns)
mask.loc[rows_common] = (df_ref.loc[rows_common] == df.loc[rows_common])

When True (same value), we don't apply any styling.当 True(相同的值)时,我们不应用任何样式。 When False, we highlight in orange:当为 False 时,我们以橙色突出显示:

mask.replace(True, None, inplace=True)
mask.replace(False, "color: black; background-color: orange;", inplace=True)

df_diff.style.apply(lambda _: mask, axis=None)

Finally if you want to save it as an excel file, provide a valid path to the path_excel argument.最后,如果要将其保存为 excel 文件,请提供path_excel参数的有效路径。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM