I have the following two columns in pandas data frame
256 Z
0 2 2
1 2 3
2 4 4
3 4 9
There are around 1594 rows. '256' and 'Z' are column headers whereas 0,1,2,3,4 are row numbers (1st column above). I want to print row numbers where value in Column '256' is not equal to values in column 'Z'. Thus output in the above case will be 1, 3. How can this comparison be made in pandas? I will be very grateful for help. Thanks.
Create the data frame:
import pandas as pd
df = pd.DataFrame({"256":[2,2,4,4], "Z": [2,3,4,9]})
ouput:
256 Z
0 2 2
1 2 3
2 4 4
3 4 9
After subsetting your data frame, use the index to get the id of rows in the subset:
row_ids = df[df["256"] != df.Z].index
gives
Int64Index([1, 3], dtype='int64')
Another way could be to use the .loc
method of pandas.DataFrame
which returns the indexed location of the rows that qualify the boolean indexing:
df.loc[(df['256'] != df['Z'])].index
with an output of:
Int64Index([1, 3], dtype='int64')
This happens to be the quickest of the listed implementations as can be seen in ipython notebook
:
import pandas as pd
import numpy as np
df = pd.DataFrame({"256":np.random.randint(0,10,1594), "Z": np.random.randint(0,10,1594)})
%timeit df.loc[(df['256'] != df['Z'])].index
%timeit row_ids = df[df["256"] != df.Z].index
%timeit rows = list(df[df['256'] != df.Z].index)
%timeit df[df['256'] != df['Z']].index
with an output of:
1000 loops, best of 3: 352 µs per loop
1000 loops, best of 3: 358 µs per loop
1000 loops, best of 3: 611 µs per loop
1000 loops, best of 3: 355 µs per loop
However, when it comes down to 5-10 microseconds it doesn't make a significant difference, but if in the future you have a very large data set timing and efficiency may become a much more important issue. For your relatively small data set of 1594 rows I would go with the solution that looks the most elegant and promotes the most readability.
You can try this:
# Assuming your DataFrame is named "frame"
rows = list(frame[frame['256'] != frame.Z].index)
rows
will now be a list containing the row numbers for which those two column values are not equal. So with your data:
>>> frame
256 Z
0 2 2
1 2 3
2 4 4
3 4 9
[4 rows x 2 columns]
>>> rows = list(frame[frame['256'] != frame.Z].index)
>>> print(rows)
[1, 3]
Assuming df
is your dataframe, this should do it:
df[df['256'] != df['Z']].index
yielding:
Int64Index([1, 3], dtype='int64')
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.