简体   繁体   English

熊猫基于多个条件从数据框中删除行,而没有for循环

[英]pandas remove rows from dataframe based on multiple conditions without for loops

I have a 6 column pandas data frame data I want to process and remove some rows based on certain conditions. 我有一个6列的熊猫数据框数据,我想根据某些条件处理并删除一些行。 the data frame is tab separated and looks like this: 数据框以制表符分隔,如下所示:

RO52_HUMAN  TRIM6_HUMAN 1.83e-136   471 45.86   216
RO52_HUMAN  TRI68_HUMAN 6.46e-127   482 42.946  207
RO52_HUMAN  TRI22_HUMAN 6.49e-121   491 41.344  203
RO52_HUMAN  TRI38_HUMAN 7.15e-117   458 42.358  194
RO52_HUMAN  TRIM5_HUMAN 3.6e-114    499 40.281  201
RO52_HUMAN  TRI39_HUMAN 2.56e-111   490 39.388  193
RO52_HUMAN  TRI11_HUMAN 2.35e-109   471 43.524  205
RO52_HUMAN  TRI27_HUMAN 1.44e-108   495 37.576  186
RO52_HUMAN  TRI34_HUMAN 6.12e-105   500 43.0    215
RO52_HUMAN  TRI17_HUMAN 1.79e-87    461 37.093  171

the criteria for removing the rows depends on thefirst two columns only. 删除行的条件仅取决于前两列。 I also have a dictionary whole keys are protein IDs like those in the first two columns and the values are also a list of other protein IDs. 我也有一个字典,整个关键字都是蛋白质ID,就像前两列中的那些一样,并且值也是其他蛋白质ID的列表。 basically I want to remove all the rows if: 基本上我想删除所有行,如果:

the value of the first column is in the dictionary as a key and if the value of the second column is in the values of for that key inside the dictionary. 第一列的值在字典中作为键,并且第二列的值在字典中用于该键的值。 I wrote the reverse logic for this and trying to execute it some how (instead to keep the rows that do not satisfy these conditions) what I wrote is this 我为此编写了反向逻辑,并尝试以某种方式(而不是保持不满足这些条件的行)执行它,这是这样写的

blast_out_filtered_df = blast_out_df[ -blast_out_df[0].isin(homolog_dict.keys()) | (blast_out_df[0].isin(homolog_dict.keys() & -blast_out_df[1].isin(homolog_dict[blast_out_df[0]]) ) ) ]

The data frame that I read into my file is called blast_out_df and the new data frame that I'm trying to create with the filtered rows is blast_out_filtered_df. 我读入文件中的数据框称为blast_out_df,而我尝试使用过滤后的行创建的新数据框为blast_out_filtered_df。 Ofcrourse running this code is giving me the following error: Ofcrourse运行此代码给我以下错误:

Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Users\mstambou\AppData\Local\Continuum\Anaconda\lib\site-
packages\pandas\core\generic.py", line 806, in __hash__
' hashed'.format(self.__class__.__name__))
TypeError: 'Series' objects are mutable, thus they cannot be hashed

This is because I'm trying to index the dictionary with the value of a column at a particular row. 这是因为我试图用特定行的列值索引字典。 How can I do this operation efficiently? 如何有效执行此操作? I implemented it usint .iterrrows() method however I have over a million rows and this is just too slow. 我使用usint .iterrrows()方法实现了它,但是我有一百万行以上,这太慢了。 Any suggestions? 有什么建议么? Thank you. 谢谢。

The dictionary looks like this: 字典看起来像这样:

homolog_dict['MAPK5_MOUSE']
['MAPK5_HUMAN']

In this case the key is 'MAPK5_MOUSE' and the value is ['MAPK5_HUMAN'] a list of one 在这种情况下,键为“ MAPK5_MOUSE”,值为['MAPK5_HUMAN”]列表之一

was able to find a solution by doing this: 通过执行以下操作找到了解决方案:

dct_2 = dict(RO52_HUMAN=['TRI68_HUMAN', 'TRI67_HUMAN'])

blast_out_df[map(isnt_in, zip(blast_out_df[1], blast_out_df[0].map(dct_2)))]

and by defining my own function: 并通过定义我自己的功能:

def isnt_in(lst_item):     
    if str(lst_item[1])== 'nan':
        return True
    return lst_item[0] not in lst_item[1]

The map function on it's own won't cut since the values for my dictionary are lists. 由于我的字典的值是列表,因此单独使用map函数不会被剪切。 Also I had to define my own function because map will return np.nan values if I cant find the keys to that dictionary, the function will return True in these cases for the purpose of this task. 我还必须定义自己的函数,因为如果我找不到该字典的键,则map将返回np.nan值,在这种情况下,该函数将为此任务返回True。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 有没有更好的方法来基于多个条件从 pandas DataFrame 行 select 行? - Is there a better way to select rows from a pandas DataFrame based on multiple conditions? 根据多个条件从 pandas dataframe 中删除具有 NaN 的行 - Drop rows with NaNs from pandas dataframe based on multiple conditions 根据 pandas ZA7F5F354226B92782117DZ 中的另一个 dataframe 的条件,从一个 dataframe 中删除行 - remove rows from one dataframe based on conditions from another dataframe in pandas Python 根据2个条件从Dataframe中删除行 - Remove Rows from Dataframe Based on 2 Conditions 根据具有相似值的多列从熊猫数据框中删除行 - Remove rows from pandas dataframe based on multiple columns with similar values 如何根据 Python/Pandas 数据框中的多个条件删除行? - How do I remove rows based on multiple conditions in Python / Pandas dataframe? 根据多个条件从数据框中删除记录 - Remove records from a dataframe based on multiple conditions 根据其他行和列的多个条件在数据框中创建新列? 包括空行? - 蟒蛇/熊猫 - Creating a new column in dataframe based on multiple conditions from other rows and columns? Including rows that are null? - Python/Pandas 根据条件从 pandas DataFrame 中删除行 - Remove rows from pandas DataFrame based on condition 根据数据框 pandas 中行的条件删除特定行 - Delete specific rows based in conditions on rows from a dataframe pandas
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM