[英]Natively compare two Pandas Dataframes
我想比较两个非常相似的 DataFrame,一个是从 json 文件和重新采样加载的,第二个是从更复杂的用例中的 CSV 文件加载的。
这些是df1
的第一个值:
page
logging_time
2021-07-04 18:14:47.000 748.0
2021-07-04 18:14:47.100 0.0
2021-07-04 18:14:47.200 0.0
2021-07-04 18:14:47.300 3.0
2021-07-04 18:14:47.400 4.0
[5 rows x 1 columns]
这些是df2
的第二个值:
@timestamp per 100 milliseconds Sum of page
0 2021-04-07 18:14:47.000 748.0
1 2021-04-07 18:14:47.100 0.0
2 2021-04-07 18:14:47.200 0.0
3 2021-04-07 18:14:47.300 3.0
4 2021-04-07 18:14:47.400 4.0
[5 rows x 2 columns]
我正在将它们与pandas.testing.assert_frame_equal
进行比较,尝试对数据进行一些自定义以使其相等,希望对此有所帮助。 第一列应该被删除,标签名称应该被忽略。
我想以最熊猫本机的方式做到这一点,而不是只比较值。
任何帮助,将不胜感激
您可以使用equals
函数来比较数据帧。 问题是列名必须匹配:
data = [
["2021-07-04 18:14:47.000", 748.0],
["2021-07-04 18:14:47.100", 0.0],
["2021-07-04 18:14:47.200", 0.0],
["2021-07-04 18:14:47.300", 3.0],
["2021-07-04 18:14:47.400", 4.0],
]
df1 = pd.DataFrame(data, columns = ["logging_time", "page"])
df1.set_index("logging_time", inplace=True)
df2 = pd.DataFrame(data1, columns = ["logging_time", "page"])
df2.columns = df2.columns
print(df1.reset_index().equals(df2))
输出:
True
from pandas.testing import assert_frame_equal
我使用的数据帧:
df1=pd.DataFrame({'page': {'2021-07-04 18:14:47.000': 748.0,
'2021-07-04 18:14:47.100': 0.0,
'2021-07-04 18:14:47.200': 0.0,
'2021-07-04 18:14:47.300': 3.0,
'2021-07-04 18:14:47.400': 4.0}})
df1.index.names=['logging_time']
df2=pd.DataFrame({'@timestamp per 100 milliseconds': {0: '2021-07-04 18:14:47.000',
1: '2021-07-04 18:14:47.100',
2: '2021-07-04 18:14:47.200',
3: '2021-07-04 18:14:47.300',
4: '2021-07-04 18:14:47.400'},
'Sum of page': {0: 748.0, 1: 0.0, 2: 0.0, 3: 3.0, 4: 4.0}})
解决方案:
df1=df1.reset_index()
#reseting the index of df1
df2.columns=df1.columns
#renaming the columns of df2 so that they become same as df1
print((df1.dtypes==df2.dtypes).all())
#If the above code return True it means they are same
#If It return False then check the output of:print(df1.dtypes==df2.dtypes)
#and change the dtypes of any one df(either df1 or df2) accordingly
#Finally:
print(assert_frame_equal(df1,df2))
#The above code prints None then It means they are equal
#otherwise it will throw AssertionError
感谢您的回答
但是df2.columns=df1.columns
失败并出现以下错误: ValueError: Length mismatch: Expected axis has 3 elements, new values have 1 elements
打印这些列给出:
print(df2.columns)
print(df1.columns)
Index(['index', '@timestamp per 100 milliseconds', 'Sum of page'], dtype='object')
Index(['page'], dtype='object')
并且列中没有可能的变化,我如何比较它们?
非常感谢您的帮助!
这是很多代码,但几乎是对给定连接键和要忽略的列的两个数据帧的全面比较。 它目前的弱点是它不会比较/分析每个数据集中可能不存在的值。
另请注意,此脚本将写出与指定连接键不同的行的 .csv 文件,并且仅写出来自两个数据集的列值。 (如果您不想写出这些文件,请注释掉该部分)
如果您更喜欢 Jupyter notebook 的外观,这里有一个 git 链接。 https://github.com/marckeelingiv/MyPyNotebooks/blob/master/Test-Prod%20Compare.ipynb
# Imports
import pandas as pd
# Set Target Data Sets
test_csv_location = 'test.csv'
prod_csv_location = 'prod.csv'
# Set what columns to join on and what colmns to remove
join_columns = ['ORIGINAL_IID','CLAIM_IID','CLAIM_LINE','EDIT_MNEMONIC']
columns_to_remove = ['Original Clean']
# Peek at the data to get a list of the column names
test_df = pd.read_csv(test_csv_location,nrows=10)
prod_df = pd.read_csv(prod_csv_location,nrows=10)
# Create a dictinary to set all colmns to strings
all_columns = set()
for c in test_df.columns.values:
all_columns.add(c)
for c in prod_df.columns.values:
all_columns.add(c)
dtypes = {}
for c in all_columns:
dtypes[f'{c}']=str
# Perform full import setting data types and specifiying index
test_df = pd.read_csv(test_csv_location,dtype=dtypes,index_col=join_columns)
prod_df = pd.read_csv(prod_csv_location,dtype=dtypes,index_col=join_columns)
# Drop desired columns
for c in columns_to_remove:
try:
del test_df[f'{c}']
except:
pass
try:
del prod_df[f'{c}']
except:
pass
# Join Data Frames to prepare for comparing
compare_df = test_df.join(
prod_df,
how='outer',
lsuffix='_test',rsuffix='_prod'
).fillna('')
# Create list of columns to compare
columns_to_compare = []
for c in all_columns:
if c not in columns_to_remove and c not in join_columns:
columns_to_compare.append(c)
# Show the difference in columns for each data set
list_of_different_columns = []
for column in columns_to_compare:
are_different = ~(compare_df[f'{column}_test']==compare_df[f'{column}_prod'])
differences = sum(are_different)
test_not_nulls = ~(compare_df[f'{column}_test']=='')
prod_not_nulls = ~(compare_df[f'{column}_prod']=='')
temp_df = compare_df[are_different & test_not_nulls & prod_not_nulls]
if len(temp_df)>0:
print(f'{differences} differences in {column}')
print(f'\t{(test_not_nulls).sum()} Nulls in Test')
print(f'\t{(prod_not_nulls).sum()} Nulls in Prod')
to_file = temp_df[[f'{column}_test',f'{column}_prod']].copy()
to_file.to_csv(path_or_buf=f'{column}_Test.csv')
list_of_different_columns.append(column)
del to_file
del temp_df,prod_not_nulls,test_not_nulls,differences,are_different
# Functions to show/analyze differences
def return_detla_df(column):
mask = ~(compare_df[f'{column}_test']==compare_df[f'{column}_prod'])
mask2 = ~(compare_df[f'{column}_test']=='')
mask3 = ~(compare_df[f'{column}_prod']=='')
df = compare_df[mask & mask2 & mask3][[f'{column}_test',f'{column}_prod']].copy()
try:
df['Delta'] = df[f'{column}_prod'].astype(float)-df[f'{column}_test'].astype(float)
df.sort_values(by='Delta',ascending=False,inplace=True)
except:
pass
return df
def show_count_of_differnces(column):
df = return_detla_df(column)
return pd.DataFrame(
df.groupby(by=[f'{column}_test',f'{column}_prod']).size(),
columns=['Count']
).sort_values('Count',ascending=False).copy()
# ### Code to run to see differences
# Copy and resulting code into individual jupyter notebook cells to dig into the differences
for c in list_of_different_columns:
print(f"## {c}")
print(f"return_detla_df('{c}')")
print(f"show_count_of_differnces('{c}')")
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.