I have a 6 column pandas data frame data I want to process and remove some rows based on certain conditions. the data frame is tab separated and looks like this:
RO52_HUMAN TRIM6_HUMAN 1.83e-136 471 45.86 216
RO52_HUMAN TRI68_HUMAN 6.46e-127 482 42.946 207
RO52_HUMAN TRI22_HUMAN 6.49e-121 491 41.344 203
RO52_HUMAN TRI38_HUMAN 7.15e-117 458 42.358 194
RO52_HUMAN TRIM5_HUMAN 3.6e-114 499 40.281 201
RO52_HUMAN TRI39_HUMAN 2.56e-111 490 39.388 193
RO52_HUMAN TRI11_HUMAN 2.35e-109 471 43.524 205
RO52_HUMAN TRI27_HUMAN 1.44e-108 495 37.576 186
RO52_HUMAN TRI34_HUMAN 6.12e-105 500 43.0 215
RO52_HUMAN TRI17_HUMAN 1.79e-87 461 37.093 171
the criteria for removing the rows depends on thefirst two columns only. I also have a dictionary whole keys are protein IDs like those in the first two columns and the values are also a list of other protein IDs. basically I want to remove all the rows if:
the value of the first column is in the dictionary as a key and if the value of the second column is in the values of for that key inside the dictionary. I wrote the reverse logic for this and trying to execute it some how (instead to keep the rows that do not satisfy these conditions) what I wrote is this
blast_out_filtered_df = blast_out_df[ -blast_out_df[0].isin(homolog_dict.keys()) | (blast_out_df[0].isin(homolog_dict.keys() & -blast_out_df[1].isin(homolog_dict[blast_out_df[0]]) ) ) ]
The data frame that I read into my file is called blast_out_df and the new data frame that I'm trying to create with the filtered rows is blast_out_filtered_df. Ofcrourse running this code is giving me the following error:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Users\mstambou\AppData\Local\Continuum\Anaconda\lib\site-
packages\pandas\core\generic.py", line 806, in __hash__
' hashed'.format(self.__class__.__name__))
TypeError: 'Series' objects are mutable, thus they cannot be hashed
This is because I'm trying to index the dictionary with the value of a column at a particular row. How can I do this operation efficiently? I implemented it usint .iterrrows() method however I have over a million rows and this is just too slow. Any suggestions? Thank you.
The dictionary looks like this:
homolog_dict['MAPK5_MOUSE']
['MAPK5_HUMAN']
In this case the key is 'MAPK5_MOUSE' and the value is ['MAPK5_HUMAN'] a list of one
was able to find a solution by doing this:
dct_2 = dict(RO52_HUMAN=['TRI68_HUMAN', 'TRI67_HUMAN'])
blast_out_df[map(isnt_in, zip(blast_out_df[1], blast_out_df[0].map(dct_2)))]
and by defining my own function:
def isnt_in(lst_item):
if str(lst_item[1])== 'nan':
return True
return lst_item[0] not in lst_item[1]
The map function on it's own won't cut since the values for my dictionary are lists. Also I had to define my own function because map will return np.nan values if I cant find the keys to that dictionary, the function will return True in these cases for the purpose of this task.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.