Python：检查两个数据框是否在同一位置包含填充单元格

Question

So basically what I would like to do is to make sure that cell (x, y) from either DF1 or DF2 is filled but not in both, for all cells in these dataframes.所以基本上我想做的是确保 DF1 或 DF2 的单元格(x, y)被填充，但不是同时填充，对于这些数据帧中的所有单元格。 DF1 and DF2 are of equal shape so there is an equal amount of cells. DF1 和 DF2 的形状相同，因此细胞数量相同。 If both cells in the same location in DF1 and DF2 are filled then it should raise an exception that something goes wrong.如果 DF1 和 DF2 中同一位置的两个单元格都已填充，则应引发异常，即出现问题。

For some reason, I can't seem to be able to wrap my head around it, although it sounds quite easy.出于某种原因，我似乎无法理解它，尽管这听起来很容易。

What I've tried:我试过的：

Check with .notnull() and then compare both of them > results in a big boolean mess that is not distinguishable.检查.notnull()然后比较它们 > 导致无法区分的大布尔混乱。
Could do it with a double for loop but that just does not seem pythonic enough.可以用一个双循环来做到这一点，但这似乎还不够pythonic。

See below examples of DF1/DF2.请参见下面的 DF1/DF2 示例。 The indices and columns are identical, only different parts are filled, the empty cells are filled with np.nan .索引和列是相同的，只有不同的部分被填充，空单元格用np.nan填充。 The cell values contain the number of orders on a certain day for a certain delivery day.单元格值包含特定交货日特定日期的订单数量。 The goal is to condense this to a matrix containing the x-week average from a certain order day (mon-sun) for a certain delivery day (mon - sun).目标是将其浓缩为一个矩阵，该矩阵包含某个订单日（周一-周日）到某个交货日（周一-周日）的 x 周平均值。

EDIT: text files and expected output编辑：文本文件和预期输出

DF1.csv DF1.csv

order_day,2022-06-18,2022-06-19,2022-06-20,2022-06-21,2022-06-22,2022-06-23,2022-06-24,2022-06-25,2022-06-26,2022-06-27,2022-06-28,2022-06-29,2022-06-30,2022-07-01,2022-07-02,2022-07-03,2022-07-04,2022-07-05,2022-07-06,2022-07-07,2022-07-08
Friday,34.0,,214.0,74.0,46.0,21.0,19.0,,,,,,,,,,,,,,
Saturday,,,79.0,154.0,75.0,28.0,16.0,14.0,,,,,,,,,,,,,
Sunday,,,,301.0,183.0,60.0,42.0,25.0,,,,,,,,,,,,,
Monday,,,,49.0,61.0,216.0,104.0,36.0,,28.0,,,,,,,,,,,
Tuesday,,,,,47.0,180.0,77.0,36.0,,17.0,8.0,,,,,,,,,,
Wednesday,,,,,,84.0,200.0,69.0,,58.0,24.0,10.0,,,,,,,,,
Thursday,,,,,,,84.0,148.0,,87.0,37.0,10.0,3.0,,,,,,,,

DF2.csv DF2.csv

order_day,2022-06-18,2022-06-19,2022-06-20,2022-06-21,2022-06-22,2022-06-23,2022-06-24,2022-06-25,2022-06-26,2022-06-27,2022-06-28,2022-06-29,2022-06-30,2022-07-01,2022-07-02,2022-07-03,2022-07-04,2022-07-05,2022-07-06,2022-07-07,2022-07-08
Friday,,,,,,,,44.0,,290.0,86.0,54.0,13.0,16.0,,,,,,,
Saturday,,,,,,,,,,135.0,177.0,125.0,24.0,28.0,8.0,,,,,,
Sunday,,,,,,,,,,,358.0,181.0,58.0,48.0,29.0,,,,,,
Monday,,,,,,,,,,,101.0,156.0,96.0,60.0,32.0,,15.0,,,,
Tuesday,,,,,,,,,,,,3.0,38.0,20.0,6.0,,4.0,2.0,,,
Wednesday,,,,,,,,,,,,,,,,,,,,,
Thursday,,,,,,,,,,,,,,,,,,,,,

Load with pd.read_csv('DF2.csv', index_col='order_day')使用pd.read_csv('DF2.csv', index_col='order_day')加载

Expected output预期产出

There is not really an exact expected output.实际上并没有确切的预期输出。 It could be something like print('No filled cells overlap!') .它可能类似于print('No filled cells overlap!') 。 For this MRE you can be fairly sure that there is no overlap.对于这个 MRE，您可以相当确定没有重叠。 However, I am going to work with larger date ranges and I don't want to rely on good faith.但是，我将使用更大的日期范围，我不想依赖善意。

Answer 1

Update更新

A most useful output to analyze:要分析的最有用的输出：

dups = (pd.concat([df1.set_index('order_day').stack(),
                   df2.set_index('order_day').stack()],
                   keys=['df1', 'df2'], axis=1)
          .loc[lambda x: x.notna().all(axis=1)])
print(dups)

# Output:
                      df1  df2
order_day                     
Fri       2022-06-20  1.0  2.0
Sat       2022-06-18  1.0  3.0
          2022-06-20  3.0  2.0
Tue       2022-06-20  3.0  1.0
Thu       2022-06-19  1.0  3.0

Setup a MRE:设置 MRE：

import pandas as pd
import numpy as np

wdays = ['Fri', 'Sat', 'Sun', 'Mon', 'Tue', 'Wed', 'Thu']
dates = ['2022-06-18', '2022-06-19', '2022-06-20']
np.random.seed(2022)
data1 = np.random.choice([1, 2, 3, np.nan], (7, 3), p=[.2, .1, .2, .5])
np.random.seed(2021)
data2 = np.random.choice([1, 2, 3, np.nan], (7, 3), p=[.1, .2, .2, .5])
df1 = pd.DataFrame(data1, wdays, dates).rename_axis('order_day').reset_index()
df2 = pd.DataFrame(data2, wdays, dates).rename_axis('order_day').reset_index()
print(df1)
print(df2)

# df1
  order_day  2022-06-18  2022-06-19  2022-06-20
0       Fri         1.0         3.0         1.0
1       Sat         1.0         NaN         3.0
2       Sun         NaN         NaN         NaN
3       Mon         NaN         NaN         NaN
4       Tue         NaN         NaN         3.0
5       Wed         3.0         3.0         NaN
6       Thu         NaN         1.0         NaN

# df2
  order_day  2022-06-18  2022-06-19  2022-06-20
0       Fri         NaN         NaN         2.0
1       Sat         3.0         NaN         2.0
2       Sun         2.0         NaN         NaN
3       Mon         NaN         1.0         1.0
4       Tue         NaN         NaN         1.0
5       Wed         NaN         NaN         NaN
6       Thu         NaN         3.0         3.0

Old answer旧答案

Flat your 2 dataframes ( stack drops NaN values by default) then concatenate them and check duplicate index:扁平化你的 2 个数据帧（ stack默认丢弃 NaN 值），然后将它们连接起来并检查重复索引：

>>> dups = (pd.concat([df1.set_index('order_day').stack(),
                   df2.set_index('order_day').stack()])
              .loc[lambda x: x.index.duplicated(keep=False)])

Series([], dtype: float64)

Answer 2

This does what I want, but it seems to me that this could be done easier/more pythonic.这是我想要的，但在我看来，这可以更容易/更pythonic。

for col in df1.columns:
    for idx in df1.index:
        if pd.notna(df1.loc[idx, col]) and pd.notna(df2.loc[idx, col]):
            raise Exception(f"Cells ({idx = }, {col = }) both contain values.")

Python：检查两个数据框是否在同一位置包含填充单元格

问题描述

EDIT: text files and expected output编辑：文本文件和预期输出

DF1.csv DF1.csv

DF2.csv DF2.csv

Expected output预期产出

2 个解决方案

解决方案1
1 已采纳 2022-06-30 13:12:55

解决方案2
0 2022-06-30 12:24:24

Python：检查两个数据框是否在同一位置包含填充单元格

问题描述

EDIT: text files and expected output编辑：文本文件和预期输出

DF1.csv DF1.csv

DF2.csv DF2.csv

Expected output预期产出

2 个解决方案

解决方案1 1 已采纳 2022-06-30 13:12:55

解决方案2 0 2022-06-30 12:24:24

解决方案1
1 已采纳 2022-06-30 13:12:55

解决方案2
0 2022-06-30 12:24:24