本机比较两个 Pandas 数据帧

Question

我想比较两个非常相似的 DataFrame，一个是从 json 文件和重新采样加载的，第二个是从更复杂的用例中的 CSV 文件加载的。

这些是df1的第一个值：

                           page
logging_time                   
2021-07-04 18:14:47.000   748.0
2021-07-04 18:14:47.100     0.0
2021-07-04 18:14:47.200     0.0
2021-07-04 18:14:47.300     3.0
2021-07-04 18:14:47.400     4.0
[5 rows x 1 columns]

这些是df2的第二个值：

   @timestamp per 100 milliseconds  Sum of page
0          2021-04-07 18:14:47.000        748.0
1          2021-04-07 18:14:47.100          0.0
2          2021-04-07 18:14:47.200          0.0
3          2021-04-07 18:14:47.300          3.0
4          2021-04-07 18:14:47.400          4.0
[5 rows x 2 columns]

我正在将它们与pandas.testing.assert_frame_equal进行比较，尝试对数据进行一些自定义以使其相等，希望对此有所帮助。 第一列应该被删除，标签名称应该被忽略。

我想以最熊猫本机的方式做到这一点，而不是只比较值。

任何帮助，将不胜感激

Answer 1

您可以使用equals函数来比较数据帧。 问题是列名必须匹配：

data = [                
    ["2021-07-04 18:14:47.000", 748.0],
    ["2021-07-04 18:14:47.100",   0.0],
    ["2021-07-04 18:14:47.200",   0.0],
    ["2021-07-04 18:14:47.300",   3.0],
    ["2021-07-04 18:14:47.400",   4.0],
]

df1 = pd.DataFrame(data, columns = ["logging_time", "page"])
df1.set_index("logging_time", inplace=True)

df2 = pd.DataFrame(data1, columns = ["logging_time", "page"])
df2.columns = df2.columns

print(df1.reset_index().equals(df2))

输出：

True

Answer 2

from pandas.testing import assert_frame_equal

我使用的数据帧：

df1=pd.DataFrame({'page': {'2021-07-04 18:14:47.000': 748.0,
  '2021-07-04 18:14:47.100': 0.0,
  '2021-07-04 18:14:47.200': 0.0,
  '2021-07-04 18:14:47.300': 3.0,
  '2021-07-04 18:14:47.400': 4.0}})
df1.index.names=['logging_time']

df2=pd.DataFrame({'@timestamp per 100 milliseconds': {0: '2021-07-04 18:14:47.000',
  1: '2021-07-04 18:14:47.100',
  2: '2021-07-04 18:14:47.200',
  3: '2021-07-04 18:14:47.300',
  4: '2021-07-04 18:14:47.400'},
 'Sum of page': {0: 748.0, 1: 0.0, 2: 0.0, 3: 3.0, 4: 4.0}})

解决方案：

df1=df1.reset_index()
#reseting the index of df1
df2.columns=df1.columns
#renaming the columns of df2 so that they become same as df1
print((df1.dtypes==df2.dtypes).all())
#If the above code return True it means they are same
#If It return False then check the output of:print(df1.dtypes==df2.dtypes) 
#and change the dtypes of any one df(either df1 or df2) accordingly
#Finally:
print(assert_frame_equal(df1,df2))
#The above code prints None then It means they are equal
#otherwise it will throw AssertionError

Answer 3

感谢您的回答

但是df2.columns=df1.columns失败并出现以下错误： ValueError: Length mismatch: Expected axis has 3 elements, new values have 1 elements

打印这些列给出：

print(df2.columns)
print(df1.columns)


Index(['index', '@timestamp per 100 milliseconds', 'Sum of page'], dtype='object')
Index(['page'], dtype='object')

并且列中没有可能的变化，我如何比较它们？

非常感谢您的帮助！

Answer 4

这是很多代码，但几乎是对给定连接键和要忽略的列的两个数据帧的全面比较。 它目前的弱点是它不会比较/分析每个数据集中可能不存在的值。

另请注意，此脚本将写出与指定连接键不同的行的 .csv 文件，并且仅写出来自两个数据集的列值。 （如果您不想写出这些文件，请注释掉该部分）

如果您更喜欢 Jupyter notebook 的外观，这里有一个 git 链接。 https://github.com/marckeelingiv/MyPyNotebooks/blob/master/Test-Prod%20Compare.ipynb

#  Imports
import pandas as pd

#  Set Target Data Sets
test_csv_location = 'test.csv'
prod_csv_location = 'prod.csv'

#  Set what columns to join on and what colmns to remove
join_columns = ['ORIGINAL_IID','CLAIM_IID','CLAIM_LINE','EDIT_MNEMONIC']
columns_to_remove = ['Original Clean']

#  Peek at the data to get a list of the column names
test_df = pd.read_csv(test_csv_location,nrows=10)
prod_df = pd.read_csv(prod_csv_location,nrows=10)

#  Create a dictinary to set all colmns to strings
all_columns = set()
for c in test_df.columns.values:
    all_columns.add(c)
for c in prod_df.columns.values:
    all_columns.add(c)

dtypes = {}
for c in all_columns:
    dtypes[f'{c}']=str

#  Perform full import setting data types and specifiying index
test_df = pd.read_csv(test_csv_location,dtype=dtypes,index_col=join_columns)
prod_df = pd.read_csv(prod_csv_location,dtype=dtypes,index_col=join_columns)

#  Drop desired columns
for c in columns_to_remove:
    try:
        del test_df[f'{c}']
    except:
        pass
    try:
        del prod_df[f'{c}']
    except:
        pass

#  Join Data Frames to prepare for comparing
compare_df = test_df.join(
    prod_df,
    how='outer',
    lsuffix='_test',rsuffix='_prod'
).fillna('')

#  Create list of columns to compare
columns_to_compare = []
for c in all_columns:
    if c not in columns_to_remove and c not in join_columns:
        columns_to_compare.append(c)

#  Show the difference in columns for each data set
list_of_different_columns = []
for column in columns_to_compare:
    are_different = ~(compare_df[f'{column}_test']==compare_df[f'{column}_prod'])
    differences = sum(are_different)
    test_not_nulls = ~(compare_df[f'{column}_test']=='')
    prod_not_nulls = ~(compare_df[f'{column}_prod']=='')
    temp_df = compare_df[are_different & test_not_nulls & prod_not_nulls]
    if len(temp_df)>0:
        print(f'{differences} differences in {column}')
        print(f'\t{(test_not_nulls).sum()} Nulls in Test')
        print(f'\t{(prod_not_nulls).sum()} Nulls in Prod')
        to_file = temp_df[[f'{column}_test',f'{column}_prod']].copy()
        to_file.to_csv(path_or_buf=f'{column}_Test.csv')
        list_of_different_columns.append(column)
        del to_file
    del temp_df,prod_not_nulls,test_not_nulls,differences,are_different

#  Functions to show/analyze differences

def return_detla_df(column):
    mask = ~(compare_df[f'{column}_test']==compare_df[f'{column}_prod'])
    mask2 = ~(compare_df[f'{column}_test']=='')
    mask3 = ~(compare_df[f'{column}_prod']=='')
    df = compare_df[mask & mask2 & mask3][[f'{column}_test',f'{column}_prod']].copy()
    try:
        df['Delta'] = df[f'{column}_prod'].astype(float)-df[f'{column}_test'].astype(float)
        df.sort_values(by='Delta',ascending=False,inplace=True)
    except:
        pass
    return df

def show_count_of_differnces(column):
    df = return_detla_df(column)
    return pd.DataFrame(
        df.groupby(by=[f'{column}_test',f'{column}_prod']).size(),
        columns=['Count']
    ).sort_values('Count',ascending=False).copy()


# ### Code to run to see differences
# Copy and resulting code into individual jupyter notebook cells to dig into the differences
for c in list_of_different_columns:
    print(f"## {c}")
    print(f"return_detla_df('{c}')")
    print(f"show_count_of_differnces('{c}')")

本机比较两个 Pandas 数据帧

问题描述

4 个解决方案

解决方案1
0 2021-07-10 15:48:44

解决方案2
0 2021-07-10 15:54:46

解决方案3
0 2021-07-11 18:24:17

解决方案4
0 2021-07-19 16:16:44

本机比较两个 Pandas 数据帧

问题描述

4 个解决方案

解决方案1 0 2021-07-10 15:48:44

解决方案2 0 2021-07-10 15:54:46

解决方案3 0 2021-07-11 18:24:17

解决方案4 0 2021-07-19 16:16:44

解决方案1
0 2021-07-10 15:48:44

解决方案2
0 2021-07-10 15:54:46

解决方案3
0 2021-07-11 18:24:17

解决方案4
0 2021-07-19 16:16:44