简体   繁体   中英

compare values in two columns of data frame

I have the following two columns in pandas data frame

     256   Z
0     2    2
1     2    3
2     4    4
3     4    9

There are around 1594 rows. '256' and 'Z' are column headers whereas 0,1,2,3,4 are row numbers (1st column above). I want to print row numbers where value in Column '256' is not equal to values in column 'Z'. Thus output in the above case will be 1, 3. How can this comparison be made in pandas? I will be very grateful for help. Thanks.

Create the data frame:

import pandas as pd
df = pd.DataFrame({"256":[2,2,4,4], "Z": [2,3,4,9]})

ouput:

    256 Z
0   2   2
1   2   3
2   4   4
3   4   9

After subsetting your data frame, use the index to get the id of rows in the subset:

row_ids = df[df["256"] != df.Z].index

gives

Int64Index([1, 3], dtype='int64')

Another way could be to use the .loc method of pandas.DataFrame which returns the indexed location of the rows that qualify the boolean indexing:

df.loc[(df['256'] != df['Z'])].index

with an output of:

Int64Index([1, 3], dtype='int64')

This happens to be the quickest of the listed implementations as can be seen in ipython notebook :

import pandas as pd
import numpy as np

df = pd.DataFrame({"256":np.random.randint(0,10,1594), "Z": np.random.randint(0,10,1594)})

%timeit df.loc[(df['256'] != df['Z'])].index
%timeit row_ids = df[df["256"] != df.Z].index
%timeit rows = list(df[df['256'] != df.Z].index)
%timeit df[df['256'] != df['Z']].index

with an output of:

1000 loops, best of 3: 352 µs per loop
1000 loops, best of 3: 358 µs per loop
1000 loops, best of 3: 611 µs per loop
1000 loops, best of 3: 355 µs per loop

However, when it comes down to 5-10 microseconds it doesn't make a significant difference, but if in the future you have a very large data set timing and efficiency may become a much more important issue. For your relatively small data set of 1594 rows I would go with the solution that looks the most elegant and promotes the most readability.

You can try this:

# Assuming your DataFrame is named "frame"
rows = list(frame[frame['256'] != frame.Z].index)

rows will now be a list containing the row numbers for which those two column values are not equal. So with your data:

>>> frame
   256  Z
0    2  2
1    2  3
2    4  4
3    4  9

[4 rows x 2 columns]
>>> rows = list(frame[frame['256'] != frame.Z].index)
>>> print(rows)
[1, 3]

Assuming df is your dataframe, this should do it:

df[df['256'] != df['Z']].index

yielding:

Int64Index([1, 3], dtype='int64')

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM