Assuming that I have a pandas DataFrame
defined below:
a b
0 N/A 3
1 1 1
2 2 0
3 2 N/A
4 0 1
5 N/A N/A
I would like to figure out how many rows with defined values in both columns a
and b
have values that are not equal. In this example there are two such rows, with indices 2 and 4. Indices 0, 3 and 5 contain undefined values in at least one of the columns and the row with index 1 has the values equal.
The approach I was thinking about would be to drop all the rows that contain undefined values in either a
or b
and then to fe take the difference between the two columns and count the number of zeros.
Use boolean indexing
with 2 masks :
df1 = df[(df['a'].isnull() == df['b'].isnull()) & (df['a'] != df['b'])]
print (df1)
a b
2 2.0 0.0
4 0.0 1.0
Detail:
print ((df['a'].isnull() == df['b'].isnull()))
0 False
1 True
2 True
3 False
4 True
dtype: bool
print ((df['a'] != df['b']))
0 True
1 False
2 True
3 True
4 True
dtype: bool
print ((df['a'].isnull() == df['b'].isnull()) & (df['a'] != df['b']))
0 False
1 False
2 True
3 False
4 True
dtype: bool
General solution working with multiple columns - first check if all Trues are not NaN
s per rows by all
and chain for compare DataFrame
by first column by eq
and return at least one True
per row by any
:
df1 = df[df.notnull().all(axis=1) & df.ne(df.iloc[:, 0], axis=0).any(axis=1)]
print (df1)
a b
2 2.0 0.0
4 0.0 1.0
Details :
print (df.notnull())
a b
0 False True
1 True True
2 True True
3 True False
4 True True
print (df.notnull().all(axis=1))
0 False
1 True
2 True
3 False
4 True
dtype: bool
print (df.ne(df.iloc[:, 0], axis=0))
a b
0 True True
1 False False
2 False True
3 False True
4 False True
print (df.ne(df.iloc[:, 0], axis=0).any(axis=1))
0 True
1 False
2 True
3 True
4 True
dtype: bool
Another solution:
df = df[(df['a'].notnull()) & (df['b'].notnull()) & (df['a'] != df['b'])]
print (df)
a b
2 2.0 0.0
4 0.0 1.0
I would use pandas.DataFrame.apply
like this:
df.dropna().apply(lambda x: x.a != x.b, axis=1)
Just drop all NaN values and then compare the two columns element-wise.
The result is
1 False
2 True
4 True
This is one way using pd.DataFrame.dropna
and pd.DataFrame.query
.
count = len(df.dropna().query('a != b')) # 2
res = df.dropna().query('a != b')
print(res)
a b
2 2.0 0.0
4 0.0 1.0
With logical comparison you have a built in way to do that and without wasting resources to sum columns.
Assuming:
>> import numpy as np
>> import pandas as pd
>> d = { 'a': [np.NaN, 1 , 2 , 2 , 0], 'b': [3, 1, 0 , np.NaN, 1]}
>> df = pd.DataFrame(d)
Easiest way might be:
>> df.dropna().a != df.dropna().b
1 False
2 True
4 True
dtype: bool
You can obviously extend the same thing to more columns.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.