简体   繁体   中英

Natively compare two Pandas Dataframes

I want to compare two very similar DataFrames, one is loaded from json file and resamples, the second one is loaded from CSV file from some more complicated use-case.

Those are the first values of df1 :

                           page
logging_time                   
2021-07-04 18:14:47.000   748.0
2021-07-04 18:14:47.100     0.0
2021-07-04 18:14:47.200     0.0
2021-07-04 18:14:47.300     3.0
2021-07-04 18:14:47.400     4.0
[5 rows x 1 columns]

And those are the second values of df2 :

   @timestamp per 100 milliseconds  Sum of page
0          2021-04-07 18:14:47.000        748.0
1          2021-04-07 18:14:47.100          0.0
2          2021-04-07 18:14:47.200          0.0
3          2021-04-07 18:14:47.300          3.0
4          2021-04-07 18:14:47.400          4.0
[5 rows x 2 columns]

I'm comparing them with pandas.testing.assert_frame_equal , trying to do some customizations for the data in order to be equal, would like some help with that. The first column should be removed and the labels names should be ignored.

I want to do that in the most pandas-native way, and not compare only the values.

Any help would be appreciated

You can use the equals function to compare the dataframes. The catch is that column names must match:

data = [                
    ["2021-07-04 18:14:47.000", 748.0],
    ["2021-07-04 18:14:47.100",   0.0],
    ["2021-07-04 18:14:47.200",   0.0],
    ["2021-07-04 18:14:47.300",   3.0],
    ["2021-07-04 18:14:47.400",   4.0],
]

df1 = pd.DataFrame(data, columns = ["logging_time", "page"])
df1.set_index("logging_time", inplace=True)

df2 = pd.DataFrame(data1, columns = ["logging_time", "page"])
df2.columns = df2.columns

print(df1.reset_index().equals(df2))

Output:

True

from pandas.testing import assert_frame_equal

Dataframes used by me:

df1=pd.DataFrame({'page': {'2021-07-04 18:14:47.000': 748.0,
  '2021-07-04 18:14:47.100': 0.0,
  '2021-07-04 18:14:47.200': 0.0,
  '2021-07-04 18:14:47.300': 3.0,
  '2021-07-04 18:14:47.400': 4.0}})
df1.index.names=['logging_time']

df2=pd.DataFrame({'@timestamp per 100 milliseconds': {0: '2021-07-04 18:14:47.000',
  1: '2021-07-04 18:14:47.100',
  2: '2021-07-04 18:14:47.200',
  3: '2021-07-04 18:14:47.300',
  4: '2021-07-04 18:14:47.400'},
 'Sum of page': {0: 748.0, 1: 0.0, 2: 0.0, 3: 3.0, 4: 4.0}})

Solution:

df1=df1.reset_index()
#reseting the index of df1
df2.columns=df1.columns
#renaming the columns of df2 so that they become same as df1
print((df1.dtypes==df2.dtypes).all())
#If the above code return True it means they are same
#If It return False then check the output of:print(df1.dtypes==df2.dtypes) 
#and change the dtypes of any one df(either df1 or df2) accordingly
#Finally:
print(assert_frame_equal(df1,df2))
#The above code prints None then It means they are equal
#otherwise it will throw AssertionError

Thanks for your answer

But df2.columns=df1.columns Failes with this error: ValueError: Length mismatch: Expected axis has 3 elements, new values have 1 elements

Printing those columns gives:

print(df2.columns)
print(df1.columns)


Index(['index', '@timestamp per 100 milliseconds', 'Sum of page'], dtype='object')
Index(['page'], dtype='object')

And no possible change in the columns worked, how can i compare them?

Thanks very much for the help!

This is a lot of code but is an almost comprehensive compare of two data frames given a join key and column(s) to ignore. Its current weakness is that it does not compare/analyze the values that may not exists in each of the data sets.

Also please note that this script will write out .csv files of the rows that are different with the join key specified and only the column values from the two data sets. (comment out that portion if you don't want to write out those files)

Here is a link in git if you like the way Jupyter notebook looks more. https://github.com/marckeelingiv/MyPyNotebooks/blob/master/Test-Prod%20Compare.ipynb

#  Imports
import pandas as pd

#  Set Target Data Sets
test_csv_location = 'test.csv'
prod_csv_location = 'prod.csv'

#  Set what columns to join on and what colmns to remove
join_columns = ['ORIGINAL_IID','CLAIM_IID','CLAIM_LINE','EDIT_MNEMONIC']
columns_to_remove = ['Original Clean']

#  Peek at the data to get a list of the column names
test_df = pd.read_csv(test_csv_location,nrows=10)
prod_df = pd.read_csv(prod_csv_location,nrows=10)

#  Create a dictinary to set all colmns to strings
all_columns = set()
for c in test_df.columns.values:
    all_columns.add(c)
for c in prod_df.columns.values:
    all_columns.add(c)

dtypes = {}
for c in all_columns:
    dtypes[f'{c}']=str

#  Perform full import setting data types and specifiying index
test_df = pd.read_csv(test_csv_location,dtype=dtypes,index_col=join_columns)
prod_df = pd.read_csv(prod_csv_location,dtype=dtypes,index_col=join_columns)

#  Drop desired columns
for c in columns_to_remove:
    try:
        del test_df[f'{c}']
    except:
        pass
    try:
        del prod_df[f'{c}']
    except:
        pass

#  Join Data Frames to prepare for comparing
compare_df = test_df.join(
    prod_df,
    how='outer',
    lsuffix='_test',rsuffix='_prod'
).fillna('')

#  Create list of columns to compare
columns_to_compare = []
for c in all_columns:
    if c not in columns_to_remove and c not in join_columns:
        columns_to_compare.append(c)

#  Show the difference in columns for each data set
list_of_different_columns = []
for column in columns_to_compare:
    are_different = ~(compare_df[f'{column}_test']==compare_df[f'{column}_prod'])
    differences = sum(are_different)
    test_not_nulls = ~(compare_df[f'{column}_test']=='')
    prod_not_nulls = ~(compare_df[f'{column}_prod']=='')
    temp_df = compare_df[are_different & test_not_nulls & prod_not_nulls]
    if len(temp_df)>0:
        print(f'{differences} differences in {column}')
        print(f'\t{(test_not_nulls).sum()} Nulls in Test')
        print(f'\t{(prod_not_nulls).sum()} Nulls in Prod')
        to_file = temp_df[[f'{column}_test',f'{column}_prod']].copy()
        to_file.to_csv(path_or_buf=f'{column}_Test.csv')
        list_of_different_columns.append(column)
        del to_file
    del temp_df,prod_not_nulls,test_not_nulls,differences,are_different

#  Functions to show/analyze differences

def return_detla_df(column):
    mask = ~(compare_df[f'{column}_test']==compare_df[f'{column}_prod'])
    mask2 = ~(compare_df[f'{column}_test']=='')
    mask3 = ~(compare_df[f'{column}_prod']=='')
    df = compare_df[mask & mask2 & mask3][[f'{column}_test',f'{column}_prod']].copy()
    try:
        df['Delta'] = df[f'{column}_prod'].astype(float)-df[f'{column}_test'].astype(float)
        df.sort_values(by='Delta',ascending=False,inplace=True)
    except:
        pass
    return df

def show_count_of_differnces(column):
    df = return_detla_df(column)
    return pd.DataFrame(
        df.groupby(by=[f'{column}_test',f'{column}_prod']).size(),
        columns=['Count']
    ).sort_values('Count',ascending=False).copy()


# ### Code to run to see differences
# Copy and resulting code into individual jupyter notebook cells to dig into the differences
for c in list_of_different_columns:
    print(f"## {c}")
    print(f"return_detla_df('{c}')")
    print(f"show_count_of_differnces('{c}')")

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM