简体   繁体   中英

pandas remove rows from dataframe based on multiple conditions without for loops

I have a 6 column pandas data frame data I want to process and remove some rows based on certain conditions. the data frame is tab separated and looks like this:

RO52_HUMAN  TRIM6_HUMAN 1.83e-136   471 45.86   216
RO52_HUMAN  TRI68_HUMAN 6.46e-127   482 42.946  207
RO52_HUMAN  TRI22_HUMAN 6.49e-121   491 41.344  203
RO52_HUMAN  TRI38_HUMAN 7.15e-117   458 42.358  194
RO52_HUMAN  TRIM5_HUMAN 3.6e-114    499 40.281  201
RO52_HUMAN  TRI39_HUMAN 2.56e-111   490 39.388  193
RO52_HUMAN  TRI11_HUMAN 2.35e-109   471 43.524  205
RO52_HUMAN  TRI27_HUMAN 1.44e-108   495 37.576  186
RO52_HUMAN  TRI34_HUMAN 6.12e-105   500 43.0    215
RO52_HUMAN  TRI17_HUMAN 1.79e-87    461 37.093  171

the criteria for removing the rows depends on thefirst two columns only. I also have a dictionary whole keys are protein IDs like those in the first two columns and the values are also a list of other protein IDs. basically I want to remove all the rows if:

the value of the first column is in the dictionary as a key and if the value of the second column is in the values of for that key inside the dictionary. I wrote the reverse logic for this and trying to execute it some how (instead to keep the rows that do not satisfy these conditions) what I wrote is this

blast_out_filtered_df = blast_out_df[ -blast_out_df[0].isin(homolog_dict.keys()) | (blast_out_df[0].isin(homolog_dict.keys() & -blast_out_df[1].isin(homolog_dict[blast_out_df[0]]) ) ) ]

The data frame that I read into my file is called blast_out_df and the new data frame that I'm trying to create with the filtered rows is blast_out_filtered_df. Ofcrourse running this code is giving me the following error:

Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Users\mstambou\AppData\Local\Continuum\Anaconda\lib\site-
packages\pandas\core\generic.py", line 806, in __hash__
' hashed'.format(self.__class__.__name__))
TypeError: 'Series' objects are mutable, thus they cannot be hashed

This is because I'm trying to index the dictionary with the value of a column at a particular row. How can I do this operation efficiently? I implemented it usint .iterrrows() method however I have over a million rows and this is just too slow. Any suggestions? Thank you.

The dictionary looks like this:

homolog_dict['MAPK5_MOUSE']
['MAPK5_HUMAN']

In this case the key is 'MAPK5_MOUSE' and the value is ['MAPK5_HUMAN'] a list of one

was able to find a solution by doing this:

dct_2 = dict(RO52_HUMAN=['TRI68_HUMAN', 'TRI67_HUMAN'])

blast_out_df[map(isnt_in, zip(blast_out_df[1], blast_out_df[0].map(dct_2)))]

and by defining my own function:

def isnt_in(lst_item):     
    if str(lst_item[1])== 'nan':
        return True
    return lst_item[0] not in lst_item[1]

The map function on it's own won't cut since the values for my dictionary are lists. Also I had to define my own function because map will return np.nan values if I cant find the keys to that dictionary, the function will return True in these cases for the purpose of this task.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM