How to count the number of defined value differences between two columns of DataFrame in pandas?

Question

Assuming that I have a pandas DataFrame defined below:

    a     b
0  N/A    3
1   1     1
2   2     0
3   2    N/A
4   0     1
5  N/A   N/A

I would like to figure out how many rows with defined values in both columns a and b have values that are not equal. In this example there are two such rows, with indices 2 and 4. Indices 0, 3 and 5 contain undefined values in at least one of the columns and the row with index 1 has the values equal.

The approach I was thinking about would be to drop all the rows that contain undefined values in either a or b and then to fe take the difference between the two columns and count the number of zeros.

Answer 1

Use boolean indexing with 2 masks :

df1 = df[(df['a'].isnull() == df['b'].isnull()) & (df['a'] != df['b'])]
print (df1)
     a    b
2  2.0  0.0
4  0.0  1.0

Detail:

print ((df['a'].isnull() == df['b'].isnull()))
0    False
1     True
2     True
3    False
4     True
dtype: bool

print ((df['a'] != df['b']))
0     True
1    False
2     True
3     True
4     True
dtype: bool

print ((df['a'].isnull() == df['b'].isnull()) & (df['a'] != df['b']))
0    False
1    False
2     True
3    False
4     True
dtype: bool

General solution working with multiple columns - first check if all Trues are not NaN s per rows by all and chain for compare DataFrame by first column by eq and return at least one True per row by any :

df1 = df[df.notnull().all(axis=1) & df.ne(df.iloc[:, 0], axis=0).any(axis=1)]
print (df1)
     a    b
2  2.0  0.0
4  0.0  1.0

Details :

print (df.notnull())
       a      b
0  False   True
1   True   True
2   True   True
3   True  False
4   True   True

print (df.notnull().all(axis=1))
0    False
1     True
2     True
3    False
4     True
dtype: bool

print (df.ne(df.iloc[:, 0], axis=0))
       a      b
0   True   True
1  False  False
2  False   True
3  False   True
4  False   True

print (df.ne(df.iloc[:, 0], axis=0).any(axis=1))
0     True
1    False
2     True
3     True
4     True
dtype: bool

Another solution:

df = df[(df['a'].notnull()) & (df['b'].notnull()) & (df['a'] != df['b'])]
print (df)
     a    b
2  2.0  0.0
4  0.0  1.0

Answer 2

I would use pandas.DataFrame.apply like this:

df.dropna().apply(lambda x: x.a != x.b, axis=1)

Just drop all NaN values and then compare the two columns element-wise.

The result is

1    False
2    True
4    True

Answer 3

This is one way using pd.DataFrame.dropna and pd.DataFrame.query .

count = len(df.dropna().query('a != b'))  # 2

res = df.dropna().query('a != b')

print(res)

     a    b
2  2.0  0.0
4  0.0  1.0

Answer 4

With logical comparison you have a built in way to do that and without wasting resources to sum columns.

Assuming:

>> import numpy as np
>> import pandas as pd     
>> d = { 'a': [np.NaN, 1 , 2 , 2 , 0], 'b': [3, 1, 0 , np.NaN, 1]}
>> df = pd.DataFrame(d)

Easiest way might be:

>> df.dropna().a != df.dropna().b

    1    False
    2     True
    4     True
    dtype: bool

You can obviously extend the same thing to more columns.

How to count the number of defined value differences between two columns of DataFrame in pandas?

Question

4 answers

solution1
1 ACCPTED 2018-04-23 13:18:10

solution2
1 2018-04-24 09:36:54

solution3
1 2018-04-24 09:49:13

solution4
0 2018-04-23 13:29:08

How to count the number of defined value differences between two columns of DataFrame in pandas?

Question

4 answers

solution1 1 ACCPTED 2018-04-23 13:18:10

solution2 1 2018-04-24 09:36:54

solution3 1 2018-04-24 09:49:13

solution4 0 2018-04-23 13:29:08

solution1
1 ACCPTED 2018-04-23 13:18:10

solution2
1 2018-04-24 09:36:54

solution3
1 2018-04-24 09:49:13

solution4
0 2018-04-23 13:29:08