简体   繁体   English

Pandas 如果一个值介于另一个数据帧的两个值之间,则过滤一个 dataframe

[英]Pandas Filtering one dataframe if a value is between two values from another data frame

I have two data frames as follows:我有两个数据框如下:

df1 df1

  chr_number      start        end strand
0       chr1  111478338  111478339      +
1       chr1  111478370  111478371      +
2       chr1  111478372  111478373      +
3       chr1  157123306  157123307      -
4       chr1  157123307  157123308      -
5       chr1  212619741  212619742      +
6       chr1  212619742  212619743      +

df2 df2

  Chromosome      Start        End  Log2 Fold Change Strand      Gene  \
0       chr1  111478330  111478444          3.036912      +  C1orf162   
1       chr1  157123300  157123338          3.293174      -      ETV3   
2       chr1  207079296  207079412          3.916122      +    PFKFB2   
3       chr1  212619736  212619771          3.880546      +      ATF3   

           Ensembl ID Feature  
0  ENSG00000143110.11  3' UTR  
1  ENSG00000117036.12  3' UTR  
2  ENSG00000123836.15  3' UTR  
3  ENSG00000162772.17  3' UTR    

I need to look if start from df1 is located between Start and End in df2.我需要查看从 df1 开始是否位于 df2 中的开始和结束之间。 If so, I'd like to have a new data frame which contains start value from df1 with corresponding row in df2.如果是这样,我想要一个新的数据框,其中包含 df1 的起始值和 df2 中的相应行。

Here is the example of what I need for each start value from df1:以下是我需要 df1 中每个起始值的示例:

   CrossLink Chromosome        Start          End  Log2 Fold Change Strand  \
1  111478338       chr1  111478330.0  111478444.0          3.036912      +   

       Gene          Ensembl ID Feature  
1  C1orf162  ENSG00000143110.11  3' UTR 

I wrote this code:我写了这段代码:

df3 = pd.DataFrame([])
df3["CrossLink"] = np.nan
for v in df1["start"]:
    df4 = df2[(df2["Start"] <= v) & (df2["End"] > v)]
    df3 = df3.append(df4)
    df3["CrossLink"] = df1["start"]

And I get this output:我得到这个 output:

   CrossLink Chromosome        Start          End  Log2 Fold Change Strand  \
0  111478338       chr1  111478330.0  111478444.0          3.036912      +   
0  111478338       chr1  111478330.0  111478444.0          3.036912      +   
0  111478338       chr1  111478330.0  111478444.0          3.036912      +   
1  111478370       chr1  157123300.0  157123338.0          3.293174      -   
1  111478370       chr1  157123300.0  157123338.0          3.293174      -   
3  157123306       chr1  212619736.0  212619771.0          3.880546      +   
3  157123306       chr1  212619736.0  212619771.0          3.880546      +   

       Gene          Ensembl ID Feature  
0  C1orf162  ENSG00000143110.11  3' UTR  
0  C1orf162  ENSG00000143110.11  3' UTR  
0  C1orf162  ENSG00000143110.11  3' UTR  
1      ETV3  ENSG00000117036.12  3' UTR  
1      ETV3  ENSG00000117036.12  3' UTR  
3      ATF3  ENSG00000162772.17  3' UTR  
3      ATF3  ENSG00000162772.17  3' UTR  

It does not contain all my start values from df1 and it gives me duplicates.它不包含我从 df1 开始的所有值,它给了我重复项。 I am quite new in python and pandas and I searched a lot but I couldn't figure it out.我在 python 和 pandas 很新,我搜索了很多但我无法弄清楚。

Thanks a lot in advance for your help!非常感谢您的帮助!

A solution using a two step process:使用两步过程的解决方案:

Let's say we have假设我们有

df = pd.DataFrame({'chr_number':['chr1', 'chr2'], 'start':[3, 5],})

df2 = pd.DataFrame({'index': ['chr1', 'chr3'], 'col': ['a', 'b'], 'start': [1, 2], 'end':[4, 5]})

print(df)
print(df2)

  chr_number  start
0       chr1      3
1       chr2      5
  index col  start  end
0  chr1   a      1    4
1  chr3   b      2    5

We can then apply aggregation and explode to get the desired output.然后我们可以应用聚合和分解以获得所需的 output。

df2.start = df2.apply(lambda x: df.loc[(x['start'] <= df.start) & (df.start <= x['end'])].start.agg(list), axis=1)
print(df2.explode('start'))

  index col start  end
0  chr1   a     3    4
1  chr3   b     3    5
1  chr3   b     5    5

Edit: I realized that I was doing the incorrect operation comparing df2 values instead of df .编辑:我意识到我在比较df2值而不是df时做的操作不正确。 The edited code now replaces df2.start with all df.start values that fall between df2.start and df2.end for rows of df2 .编辑后的代码现在将df2.start替换为df2行的df2.startdf2.end之间的所有df.start值。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Pandas 根据另一个数据帧中的值过滤一个 dataframe 中的行 - Pandas filtering rows in one dataframe based on values in another data frame 熊猫:检查一个数据框的日期是否在另一个数据框的两个日期之间,并吸收值 - Pandas: check if date from one dataframe is between two dates from another dataframe and sobstitute values 通过来自另一个数据框的值列表拆分大熊猫数据框 - Split a pandas dataframe by a list of values from another data frame 根据 pandas 中另一个数据帧中的某些条件将值从一个数据帧拆分到另一个数据帧 - Splitting values from one data frame to another data frame based on certain conditions in another data frame in pandas 根据条件将值从一个pandas数据帧替换为另一个pandas数据帧 - Substitute values from one pandas data frame to another based on condition 如何在熊猫数据框中两次隔离,然后仅在较大的熊猫数据框中修改这些值? - How isolate between two times in pandas data frame and then modify just those values in a larger pandas dataframe? Python Pandas:计算一个数据框中的值出现在另一个数据框中的次数,在多个条件下进行归档 - Python Pandas: Counting how many times value from one data frame appears in another dataframe, filering on multiple conditions pandas.DataFrame 过滤两个日期之间的数据 - pandas.DataFrame filtering data between two dates Pandas:使用基于两列的另一个数据帧中的值替换一个数据帧中的值 - Pandas: replace values in one dataframe with values from another dataframe based on two columns 从一个数据框中获取唯一计数作为熊猫中另一个数据框中的值 - Get unique counts from one data frame as values in another data frame in Pandas
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM