Pandas 如果一个值介于另一个数据帧的两个值之间，则过滤一个 dataframe

Question

I have two data frames as follows:我有两个数据框如下：

df1 df1

  chr_number      start        end strand
0       chr1  111478338  111478339      +
1       chr1  111478370  111478371      +
2       chr1  111478372  111478373      +
3       chr1  157123306  157123307      -
4       chr1  157123307  157123308      -
5       chr1  212619741  212619742      +
6       chr1  212619742  212619743      +

df2 df2

  Chromosome      Start        End  Log2 Fold Change Strand      Gene  \
0       chr1  111478330  111478444          3.036912      +  C1orf162   
1       chr1  157123300  157123338          3.293174      -      ETV3   
2       chr1  207079296  207079412          3.916122      +    PFKFB2   
3       chr1  212619736  212619771          3.880546      +      ATF3   

           Ensembl ID Feature  
0  ENSG00000143110.11  3' UTR  
1  ENSG00000117036.12  3' UTR  
2  ENSG00000123836.15  3' UTR  
3  ENSG00000162772.17  3' UTR

I need to look if start from df1 is located between Start and End in df2.我需要查看从 df1 开始是否位于 df2 中的开始和结束之间。 If so, I'd like to have a new data frame which contains start value from df1 with corresponding row in df2.如果是这样，我想要一个新的数据框，其中包含 df1 的起始值和 df2 中的相应行。

Here is the example of what I need for each start value from df1:以下是我需要 df1 中每个起始值的示例：

   CrossLink Chromosome        Start          End  Log2 Fold Change Strand  \
1  111478338       chr1  111478330.0  111478444.0          3.036912      +   

       Gene          Ensembl ID Feature  
1  C1orf162  ENSG00000143110.11  3' UTR

I wrote this code:我写了这段代码：

df3 = pd.DataFrame([])
df3["CrossLink"] = np.nan
for v in df1["start"]:
    df4 = df2[(df2["Start"] <= v) & (df2["End"] > v)]
    df3 = df3.append(df4)
    df3["CrossLink"] = df1["start"]

And I get this output:我得到这个 output：

   CrossLink Chromosome        Start          End  Log2 Fold Change Strand  \
0  111478338       chr1  111478330.0  111478444.0          3.036912      +   
0  111478338       chr1  111478330.0  111478444.0          3.036912      +   
0  111478338       chr1  111478330.0  111478444.0          3.036912      +   
1  111478370       chr1  157123300.0  157123338.0          3.293174      -   
1  111478370       chr1  157123300.0  157123338.0          3.293174      -   
3  157123306       chr1  212619736.0  212619771.0          3.880546      +   
3  157123306       chr1  212619736.0  212619771.0          3.880546      +   

       Gene          Ensembl ID Feature  
0  C1orf162  ENSG00000143110.11  3' UTR  
0  C1orf162  ENSG00000143110.11  3' UTR  
0  C1orf162  ENSG00000143110.11  3' UTR  
1      ETV3  ENSG00000117036.12  3' UTR  
1      ETV3  ENSG00000117036.12  3' UTR  
3      ATF3  ENSG00000162772.17  3' UTR  
3      ATF3  ENSG00000162772.17  3' UTR

It does not contain all my start values from df1 and it gives me duplicates.它不包含我从 df1 开始的所有值，它给了我重复项。 I am quite new in python and pandas and I searched a lot but I couldn't figure it out.我在 python 和 pandas 很新，我搜索了很多但我无法弄清楚。

Thanks a lot in advance for your help!非常感谢您的帮助！

Answer 1

A solution using a two step process:使用两步过程的解决方案：

Let's say we have假设我们有

df = pd.DataFrame({'chr_number':['chr1', 'chr2'], 'start':[3, 5],})

df2 = pd.DataFrame({'index': ['chr1', 'chr3'], 'col': ['a', 'b'], 'start': [1, 2], 'end':[4, 5]})

print(df)
print(df2)

  chr_number  start
0       chr1      3
1       chr2      5
  index col  start  end
0  chr1   a      1    4
1  chr3   b      2    5

We can then apply aggregation and explode to get the desired output.然后我们可以应用聚合和分解以获得所需的 output。

df2.start = df2.apply(lambda x: df.loc[(x['start'] <= df.start) & (df.start <= x['end'])].start.agg(list), axis=1)
print(df2.explode('start'))

  index col start  end
0  chr1   a     3    4
1  chr3   b     3    5
1  chr3   b     5    5

Edit: I realized that I was doing the incorrect operation comparing df2 values instead of df .编辑：我意识到我在比较df2值而不是df时做的操作不正确。 The edited code now replaces df2.start with all df.start values that fall between df2.start and df2.end for rows of df2 .编辑后的代码现在将df2.start替换为df2行的df2.start和df2.end之间的所有df.start值。

Pandas 如果一个值介于另一个数据帧的两个值之间，则过滤一个 dataframe

问题描述

1 个解决方案

解决方案1
0 2022-02-15 00:02:00

Pandas 如果一个值介于另一个数据帧的两个值之间，则过滤一个 dataframe

问题描述

1 个解决方案

解决方案1 0 2022-02-15 00:02:00

解决方案1
0 2022-02-15 00:02:00