[英]Python: check if two dataframes contain filled cells in the same location
So basically what I would like to do is to make sure that cell (x, y)
from either DF1 or DF2 is filled but not in both, for all cells in these dataframes.所以基本上我想做的是确保 DF1 或 DF2 的单元格
(x, y)
被填充,但不是同时填充,对于这些数据帧中的所有单元格。 DF1 and DF2 are of equal shape so there is an equal amount of cells. DF1 和 DF2 的形状相同,因此细胞数量相同。 If both cells in the same location in DF1 and DF2 are filled then it should raise an exception that something goes wrong.
如果 DF1 和 DF2 中同一位置的两个单元格都已填充,则应引发异常,即出现问题。
For some reason, I can't seem to be able to wrap my head around it, although it sounds quite easy.出于某种原因,我似乎无法理解它,尽管这听起来很容易。
What I've tried:我试过的:
.notnull()
and then compare both of them > results in a big boolean mess that is not distinguishable..notnull()
然后比较它们 > 导致无法区分的大布尔混乱。 See below examples of DF1/DF2.请参见下面的 DF1/DF2 示例。 The indices and columns are identical, only different parts are filled, the empty cells are filled with
np.nan
.索引和列是相同的,只有不同的部分被填充,空单元格用
np.nan
填充。 The cell values contain the number of orders on a certain day for a certain delivery day.单元格值包含特定交货日特定日期的订单数量。 The goal is to condense this to a matrix containing the x-week average from a certain order day (mon-sun) for a certain delivery day (mon - sun).
目标是将其浓缩为一个矩阵,该矩阵包含某个订单日(周一-周日)到某个交货日(周一-周日)的 x 周平均值。
order_day,2022-06-18,2022-06-19,2022-06-20,2022-06-21,2022-06-22,2022-06-23,2022-06-24,2022-06-25,2022-06-26,2022-06-27,2022-06-28,2022-06-29,2022-06-30,2022-07-01,2022-07-02,2022-07-03,2022-07-04,2022-07-05,2022-07-06,2022-07-07,2022-07-08
Friday,34.0,,214.0,74.0,46.0,21.0,19.0,,,,,,,,,,,,,,
Saturday,,,79.0,154.0,75.0,28.0,16.0,14.0,,,,,,,,,,,,,
Sunday,,,,301.0,183.0,60.0,42.0,25.0,,,,,,,,,,,,,
Monday,,,,49.0,61.0,216.0,104.0,36.0,,28.0,,,,,,,,,,,
Tuesday,,,,,47.0,180.0,77.0,36.0,,17.0,8.0,,,,,,,,,,
Wednesday,,,,,,84.0,200.0,69.0,,58.0,24.0,10.0,,,,,,,,,
Thursday,,,,,,,84.0,148.0,,87.0,37.0,10.0,3.0,,,,,,,,
order_day,2022-06-18,2022-06-19,2022-06-20,2022-06-21,2022-06-22,2022-06-23,2022-06-24,2022-06-25,2022-06-26,2022-06-27,2022-06-28,2022-06-29,2022-06-30,2022-07-01,2022-07-02,2022-07-03,2022-07-04,2022-07-05,2022-07-06,2022-07-07,2022-07-08
Friday,,,,,,,,44.0,,290.0,86.0,54.0,13.0,16.0,,,,,,,
Saturday,,,,,,,,,,135.0,177.0,125.0,24.0,28.0,8.0,,,,,,
Sunday,,,,,,,,,,,358.0,181.0,58.0,48.0,29.0,,,,,,
Monday,,,,,,,,,,,101.0,156.0,96.0,60.0,32.0,,15.0,,,,
Tuesday,,,,,,,,,,,,3.0,38.0,20.0,6.0,,4.0,2.0,,,
Wednesday,,,,,,,,,,,,,,,,,,,,,
Thursday,,,,,,,,,,,,,,,,,,,,,
Load with pd.read_csv('DF2.csv', index_col='order_day')
使用
pd.read_csv('DF2.csv', index_col='order_day')
加载
There is not really an exact expected output.实际上并没有确切的预期输出。 It could be something like
print('No filled cells overlap!')
.它可能类似于
print('No filled cells overlap!')
。 For this MRE you can be fairly sure that there is no overlap.对于这个 MRE,您可以相当确定没有重叠。 However, I am going to work with larger date ranges and I don't want to rely on good faith.
但是,我将使用更大的日期范围,我不想依赖善意。
Update更新
A most useful output to analyze:要分析的最有用的输出:
dups = (pd.concat([df1.set_index('order_day').stack(),
df2.set_index('order_day').stack()],
keys=['df1', 'df2'], axis=1)
.loc[lambda x: x.notna().all(axis=1)])
print(dups)
# Output:
df1 df2
order_day
Fri 2022-06-20 1.0 2.0
Sat 2022-06-18 1.0 3.0
2022-06-20 3.0 2.0
Tue 2022-06-20 3.0 1.0
Thu 2022-06-19 1.0 3.0
Setup a MRE:设置 MRE:
import pandas as pd
import numpy as np
wdays = ['Fri', 'Sat', 'Sun', 'Mon', 'Tue', 'Wed', 'Thu']
dates = ['2022-06-18', '2022-06-19', '2022-06-20']
np.random.seed(2022)
data1 = np.random.choice([1, 2, 3, np.nan], (7, 3), p=[.2, .1, .2, .5])
np.random.seed(2021)
data2 = np.random.choice([1, 2, 3, np.nan], (7, 3), p=[.1, .2, .2, .5])
df1 = pd.DataFrame(data1, wdays, dates).rename_axis('order_day').reset_index()
df2 = pd.DataFrame(data2, wdays, dates).rename_axis('order_day').reset_index()
print(df1)
print(df2)
# df1
order_day 2022-06-18 2022-06-19 2022-06-20
0 Fri 1.0 3.0 1.0
1 Sat 1.0 NaN 3.0
2 Sun NaN NaN NaN
3 Mon NaN NaN NaN
4 Tue NaN NaN 3.0
5 Wed 3.0 3.0 NaN
6 Thu NaN 1.0 NaN
# df2
order_day 2022-06-18 2022-06-19 2022-06-20
0 Fri NaN NaN 2.0
1 Sat 3.0 NaN 2.0
2 Sun 2.0 NaN NaN
3 Mon NaN 1.0 1.0
4 Tue NaN NaN 1.0
5 Wed NaN NaN NaN
6 Thu NaN 3.0 3.0
Old answer旧答案
Flat your 2 dataframes ( stack
drops NaN values by default) then concatenate them and check duplicate index:扁平化你的 2 个数据帧(
stack
默认丢弃 NaN 值),然后将它们连接起来并检查重复索引:
>>> dups = (pd.concat([df1.set_index('order_day').stack(),
df2.set_index('order_day').stack()])
.loc[lambda x: x.index.duplicated(keep=False)])
Series([], dtype: float64)
This does what I want, but it seems to me that this could be done easier/more pythonic.这是我想要的,但在我看来,这可以更容易/更pythonic。
for col in df1.columns:
for idx in df1.index:
if pd.notna(df1.loc[idx, col]) and pd.notna(df2.loc[idx, col]):
raise Exception(f"Cells ({idx = }, {col = }) both contain values.")
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.