[英]How do I check for conflict between columns in a pandas dataframe?
I'm working on a Dataframe which contains multiple possible values from three different sources for a single item, which is in the index, such as:我正在研究 Dataframe ,其中包含来自三个不同来源的单个项目的多个可能值,该项目位于索引中,例如:
import pandas as pd
import numpy as np
inp = [
{"Item": "Item1", "Local A": np.nan, "Local B": 6, "Local C": 5},
{"Item": "Item2", "Local A": 6, "Local B": 7, "Local C": 5},
{"Item": "Item3", "Local A": np.nan, "Local B": np.nan, "Local C": 5},
{"Item": "Item4", "Local A": 5, "Local B": 5, "Local C": 5},
{"Item": "Item5", "Local A": 5, "Local B": np.nan, "Local C": 5},
]
df = pd.DataFrame(inp)
print(df)
Output: Output:
Item Local A Local B Local C
0 Item1 NaN 6.0 5
1 Item2 6.0 7.0 5
2 Item3 NaN NaN 5
3 Item4 5.0 5.0 5
4 Item5 5.0 NaN 5
My goal is to create a column which specifies if there is conflict between sources when there are multiple non-null values for an index (some cells are empty).我的目标是创建一个列,指定当索引有多个非空值(某些单元格为空)时源之间是否存在冲突。
Ideal Output:理想 Output:
Item Local A Local B Local C Conflict
0 Item1 NaN 6.0 5 yes
1 Item2 6.0 7.0 5 yes
2 Item3 NaN NaN 5 NaN
3 Item4 5.0 5.0 5 NaN
4 Item5 5.0 NaN 5 NaN
In order to do that I decided to build a filter that checks if the three sources are non-null and if they are different.为了做到这一点,我决定构建一个过滤器来检查三个源是否为非空以及它们是否不同。
I built the filters for the three other cases consisting of two values being available for an index.我为其他三种情况构建了过滤器,其中包括可用于索引的两个值。
condition1 = (
df["Local A"].notnull() & df["Local B"].notnull() & df["Local C"].notnull()
) & ~(df["Local A"] == df["Local B"] == df["Local C"])
condition2 = (df["Local A"].notnull() & df["Local B"].notnull()) & ~(
df["Local A"] == df["Local B"]
)
condition3 = (df["Local B"].notnull() & df["Local C"].notnull()) & ~(
df["Local B"] == df["Local C"]
)
condition4 = (df["Local A"].notnull() & df["Local C"].notnull()) & ~(
df["Local A"] == df["Local C"]
)
df.loc[condition1 | condition2 | condition3 | condition4, "Conflict"] = "yes"
This solution of enumerating the different possible outcomes is not very elegant but I wasn't able to find a simpler alternative.这种枚举不同可能结果的解决方案不是很优雅,但我无法找到更简单的替代方案。 Moreover, I get the following error while running the script:
此外,我在运行脚本时收到以下错误:
ValueError: The truth value of a Series is ambiguous. ValueError:Series 的真值不明确。 Use a.empty, a.bool(), a.item(), a.any() or a.all().
使用 a.empty、a.bool()、a.item()、a.any() 或 a.all()。
I've seen this a few times and was able to find the cause, but I just can't figure this one out.我已经看过几次并且能够找到原因,但我就是无法弄清楚这一点。 It seems that I'm comparing Bool series instead of individual cases like I want to.
似乎我正在比较 Bool 系列,而不是像我想要的那样比较个别情况。
IIUC, try: IIUC,尝试:
df['Conflict'] = np.where((df.iloc[:, 1:].nunique(axis=1) != 1),'Yes',np.nan)
Output: Output:
Item Local A Local B Local C Conflict
0 Item1 NaN 6.0 5 Yes
1 Item2 6.0 7.0 5 Yes
2 Item3 NaN NaN 5 nan
3 Item4 5.0 5.0 5 nan
4 Item5 5.0 NaN 5 nan
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.