简体   繁体   English

在有条件的情况下遍历熊猫数据框

[英]Iterating through Pandas Data Frame with conditions

I am trying to iterate through a large data frame. 我正在尝试遍历大数据框。 However, I can't figure out how to include some conditions. 但是,我不知道如何包括一些条件。 Below is an example of my data frame: 以下是我的数据框的示例:

       0        1        2    3
0  chr3R  4174822  4174922    1.0
1  chr3R  4175400  4175500    0.0
2  chr3R  4175466  4175566    0.5
3  chr3R  4175521  4175621    1.0
4  chr3R  4175603  4175703    0.0

I want to iterate through the rows and find the row where the difference of column 1 of x row with column 1 of row 1 is less than 5000. If the difference between x row and row 1 is less than 5000 then select the values of column 3 for rows x to 1 to put into a list. 我想遍历各行并找到其中x行的第1列与第1行的第1列之差小于5000的行。如果x行与第1行的差小于5000,则选择column的值3表示将x到1的行放入列表中。 I then want to iterate this condition through out the data frame and make a list of lists for values of column 3. 然后,我想遍历数据框遍历此条件,并为第3列的值列出一个列表。

I tried using iterrows() but I just go through the entire data frame and get nothing out. 我尝试使用iterrows(),但我只是遍历整个数据框架而一无所有。

Thanks. 谢谢。

Rodrigo 罗德里戈

This can be done without using iterrows. 无需使用迭代即可完成此操作。 All above will work. 以上所有都可以。 Another approach will be to use np.where from package numpy. 另一种方法是使用numpy软件包中的np.where。 This is an example. 这是一个例子。 Pl. Pl。 modify per your requirement. 根据您的要求进行修改。

    import numpy as np 
    df['newcol'] = np.where(df[1]- df[1].iloc[0] <  5000, 1, df[1])
    dfList = df['newcol'].tolist()

Let's take your dataset, adding a few more rows to that. 让我们以您的数据集为例,再添加几行。

>>> import pandas as pd
>>> data = pd.DataFrame([
...     ['chr3R', 4174822, 4174922, 1.0],
...     ['chr3R', 4175400, 4175500, 0.0],
...     ['chr3R', 4175466, 4175566, 0.5],
...     ['chr3R', 4175521, 4175621, 1.0],
...     ['chr3R', 4175603, 4175703, 0.0],
...     ['chr3R', 5005603, 4175703, 0.0],   # col 1 is more than 5000 away
...     ['chr3R', 6005603, 4175703, 0.0],   # col 1 is more than 5000 away
... ])

The last two rows were added to show an example of a row where column 1 is more than 5000 higher than 4174822 (the first value of column 1.) 添加了最后两行以显示其中第1列比4174822(第1列的第一个值)高出5000的行的示例。

You can filter out those values which are within 5,000 of the first value of 4174822 as follows: 您可以按以下步骤过滤掉第一个值4174822的5,000以内的值:

>>> subset = data[data[1] - data[1][0] < 5000]
>>> subset
       0        1        2    3
0  chr3R  4174822  4174922  1.0
1  chr3R  4175400  4175500  0.0
2  chr3R  4175466  4175566  0.5
3  chr3R  4175521  4175621  1.0
4  chr3R  4175603  4175703  0.0

... and then iterate using .iterrows() . ...然后使用.iterrows()进行迭代。

>>> for index, row in subset.iterrows():
...     # do something with row
>>> df[(df.iloc[:, 1] - df.iat[1, 1]) < 5000][3].tolist()
[1.0, 0.0, 0.5, 1.0, 0.0]

df.iloc[:, 1] selects all rows in column 1, subtracts the value at row one, column one using df.iat[1, 1] , and then filters for values less than 5000. df.iloc[:, 1]选择第df.iloc[:, 1]列中的所有行,并使用df.iat[1, 1]减去第一行,第一列的值,然后过滤小于5000的值。

the [3] at the end then selects the third column (which would return a Series). 然后,末尾的[3]选择第三列(这将返回一个Series)。 But since you want a list, just append .tolist() to the result. 但是,由于需要列表,因此只需将.tolist()追加到结果中即可。

Great Thanks Guys, 非常感谢你们,

However, I need to create a list of lists. 但是,我需要创建一个列表列表。 I can grab the first rows that have a difference of 5000 with the first row. 我可以抓住与第一行相差5000的第一行。 I need to grab the next rows with a difference of 5000. What is the best way to iterate through this process? 我需要抓住相差5000的下一行。迭代此过程的最佳方法是什么?

Thanks. 谢谢。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM