简体   繁体   中英

How to count the number of defined value differences between two columns of DataFrame in pandas?

Assuming that I have a pandas DataFrame defined below:

    a     b
0  N/A    3
1   1     1
2   2     0
3   2    N/A
4   0     1
5  N/A   N/A

I would like to figure out how many rows with defined values in both columns a and b have values that are not equal. In this example there are two such rows, with indices 2 and 4. Indices 0, 3 and 5 contain undefined values in at least one of the columns and the row with index 1 has the values equal.

The approach I was thinking about would be to drop all the rows that contain undefined values in either a or b and then to fe take the difference between the two columns and count the number of zeros.

Use boolean indexing with 2 masks :

df1 = df[(df['a'].isnull() == df['b'].isnull()) & (df['a'] != df['b'])]
print (df1)
     a    b
2  2.0  0.0
4  0.0  1.0

Detail:

print ((df['a'].isnull() == df['b'].isnull()))
0    False
1     True
2     True
3    False
4     True
dtype: bool

print ((df['a'] != df['b']))
0     True
1    False
2     True
3     True
4     True
dtype: bool

print ((df['a'].isnull() == df['b'].isnull()) & (df['a'] != df['b']))
0    False
1    False
2     True
3    False
4     True
dtype: bool

General solution working with multiple columns - first check if all Trues are not NaN s per rows by all and chain for compare DataFrame by first column by eq and return at least one True per row by any :

df1 = df[df.notnull().all(axis=1) & df.ne(df.iloc[:, 0], axis=0).any(axis=1)]
print (df1)
     a    b
2  2.0  0.0
4  0.0  1.0

Details :

print (df.notnull())
       a      b
0  False   True
1   True   True
2   True   True
3   True  False
4   True   True

print (df.notnull().all(axis=1))
0    False
1     True
2     True
3    False
4     True
dtype: bool

print (df.ne(df.iloc[:, 0], axis=0))
       a      b
0   True   True
1  False  False
2  False   True
3  False   True
4  False   True

print (df.ne(df.iloc[:, 0], axis=0).any(axis=1))
0     True
1    False
2     True
3     True
4     True
dtype: bool

Another solution:

df = df[(df['a'].notnull()) & (df['b'].notnull()) & (df['a'] != df['b'])]
print (df)
     a    b
2  2.0  0.0
4  0.0  1.0

I would use pandas.DataFrame.apply like this:

df.dropna().apply(lambda x: x.a != x.b, axis=1)

Just drop all NaN values and then compare the two columns element-wise.

The result is

1    False
2    True
4    True

This is one way using pd.DataFrame.dropna and pd.DataFrame.query .

count = len(df.dropna().query('a != b'))  # 2

res = df.dropna().query('a != b')

print(res)

     a    b
2  2.0  0.0
4  0.0  1.0

With logical comparison you have a built in way to do that and without wasting resources to sum columns.

Assuming:

>> import numpy as np
>> import pandas as pd     
>> d = { 'a': [np.NaN, 1 , 2 , 2 , 0], 'b': [3, 1, 0 , np.NaN, 1]}
>> df = pd.DataFrame(d)

Easiest way might be:

>> df.dropna().a != df.dropna().b

    1    False
    2     True
    4     True
    dtype: bool

You can obviously extend the same thing to more columns.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM