简体   繁体   中英

IndexError: tuple index out of range. Accessing column in specific row

I'm getting a IndexError issue that I cannot fix. What I am trying to do is iterate through rows of data and compare a specific column in one row to the same column in a different row. If they are the same it should put them in the badBucket otherwise it goes in to a goodBucket.

Here is my code:

XDFDF =pd.DataFrame(XDF)
ct1 = 0
ct2 = 0
goodBucket = []
badBucket = []
duplicate = False
for row in XDFDF.iterrows():
    for row2 in XDFDF.iterrows():
        if ct1 != ct2:
            if row[6] == row2[6]:
                badBucket.append(row2)
                duplicate = True
            else:
                goodBucket.append(row2)
        ct2 += 1
    if duplicate:
        badBucket.append(row)
        duplicate = False
    ct1 += 1

Note: XDFDF is a relatively big pandas DataFrame with 6 columns (0,1,2,3,4,5,6).

My Error is:

Traceback (most recent call last):
  File "/Users/john_crowley/PycharmProjects/Greatness/venv/Recipes.py", line 118, in <module>
    if row[6] == row2[6]:
IndexError: tuple index out of range

Process finished with exit code 1

Note: line 118 is the line where 'if row[5] == row[5]' is typed.

If anyone has a resolution to the specific issue at hand to resolve the IndexError your help would be greatly appreciated, or any comments on improving code would be appreciated as well. If you have any questions please let me know and I will get back to you as soon as I can.

iterrows() return not just a row as you expect, but the tuple of row index and row itself. So this tuple of two values doesn't have index 6, so you get the Exception: "tuple index out of range" (pay attention to tuple )

If you don't need row index, you can use any name, the best one is _ which is a correct variable name and it is used in python to mark variabales that you don't need. So the correct loop code is

for _, row in XDFDF.iterrows():
    for _, row2 in XDFDF.iterrows():

Or if the index is just a sequence of integers from 0, you can use it instead of ct1 and ct2 if you assume ct2 should be reset to 0 at the begining of each loop for row2 (by the way there is no ct2 = 0 before this loop which might be a logical error). To ensure the index is really like needed I would recomend force reset_index(drop=True) before the loop. Otherwise it would be difficult to find problem if you would manipulate with data before and break the index sequence.

But in fact if you only need to find duplicate values (you code is not doing exactly that but I'm not sure if this is assumed or a logical error in the code) you can use pandas drop_duplicates which will do all the work for you. So we can create column "unique" and set True for those indexes which aren't dropped as duplicate

XDFDF["unique"] = False
indexes_of_unique = XDFDF.loc[:, 6].drop_duplicates(keep=False).index
XDFDF.loc[indexes_of_unique, "unique"] = True

This part is the most important XDFDF.loc[:, 6].drop_duplicates(keep=False).index . It gets the column 6, drops duplicate values (by default it keeps one duplicate value, but keep=False forces drop all values which have duplicates). So now we have the indexes of unique values and we can mark them. Important note, that indexes in pandas aren't guaranteed to be uniuqe, so I'd recomend to make XDFDF.reset_index(drop=True, inplace=True) to ensure there wouldn't be logical collisions with duplicate indexes.

There's no need to code the logic for identifying duplicates yourself; use DataFrame.duplicated (predicated on column 6 with keep=False from what I gather you're trying to do) instead.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM