[英]Pandas Filtering one dataframe if a value is between two values from another data frame
I have two data frames as follows:我有两个数据框如下:
df1 df1
chr_number start end strand
0 chr1 111478338 111478339 +
1 chr1 111478370 111478371 +
2 chr1 111478372 111478373 +
3 chr1 157123306 157123307 -
4 chr1 157123307 157123308 -
5 chr1 212619741 212619742 +
6 chr1 212619742 212619743 +
df2 df2
Chromosome Start End Log2 Fold Change Strand Gene \
0 chr1 111478330 111478444 3.036912 + C1orf162
1 chr1 157123300 157123338 3.293174 - ETV3
2 chr1 207079296 207079412 3.916122 + PFKFB2
3 chr1 212619736 212619771 3.880546 + ATF3
Ensembl ID Feature
0 ENSG00000143110.11 3' UTR
1 ENSG00000117036.12 3' UTR
2 ENSG00000123836.15 3' UTR
3 ENSG00000162772.17 3' UTR
I need to look if start from df1 is located between Start and End in df2.我需要查看从 df1 开始是否位于 df2 中的开始和结束之间。 If so, I'd like to have a new data frame which contains start value from df1 with corresponding row in df2.
如果是这样,我想要一个新的数据框,其中包含 df1 的起始值和 df2 中的相应行。
Here is the example of what I need for each start value from df1:以下是我需要 df1 中每个起始值的示例:
CrossLink Chromosome Start End Log2 Fold Change Strand \
1 111478338 chr1 111478330.0 111478444.0 3.036912 +
Gene Ensembl ID Feature
1 C1orf162 ENSG00000143110.11 3' UTR
I wrote this code:我写了这段代码:
df3 = pd.DataFrame([])
df3["CrossLink"] = np.nan
for v in df1["start"]:
df4 = df2[(df2["Start"] <= v) & (df2["End"] > v)]
df3 = df3.append(df4)
df3["CrossLink"] = df1["start"]
And I get this output:我得到这个 output:
CrossLink Chromosome Start End Log2 Fold Change Strand \
0 111478338 chr1 111478330.0 111478444.0 3.036912 +
0 111478338 chr1 111478330.0 111478444.0 3.036912 +
0 111478338 chr1 111478330.0 111478444.0 3.036912 +
1 111478370 chr1 157123300.0 157123338.0 3.293174 -
1 111478370 chr1 157123300.0 157123338.0 3.293174 -
3 157123306 chr1 212619736.0 212619771.0 3.880546 +
3 157123306 chr1 212619736.0 212619771.0 3.880546 +
Gene Ensembl ID Feature
0 C1orf162 ENSG00000143110.11 3' UTR
0 C1orf162 ENSG00000143110.11 3' UTR
0 C1orf162 ENSG00000143110.11 3' UTR
1 ETV3 ENSG00000117036.12 3' UTR
1 ETV3 ENSG00000117036.12 3' UTR
3 ATF3 ENSG00000162772.17 3' UTR
3 ATF3 ENSG00000162772.17 3' UTR
It does not contain all my start values from df1 and it gives me duplicates.它不包含我从 df1 开始的所有值,它给了我重复项。 I am quite new in python and pandas and I searched a lot but I couldn't figure it out.
我在 python 和 pandas 很新,我搜索了很多但我无法弄清楚。
Thanks a lot in advance for your help!非常感谢您的帮助!
A solution using a two step process:使用两步过程的解决方案:
Let's say we have假设我们有
df = pd.DataFrame({'chr_number':['chr1', 'chr2'], 'start':[3, 5],})
df2 = pd.DataFrame({'index': ['chr1', 'chr3'], 'col': ['a', 'b'], 'start': [1, 2], 'end':[4, 5]})
print(df)
print(df2)
chr_number start
0 chr1 3
1 chr2 5
index col start end
0 chr1 a 1 4
1 chr3 b 2 5
We can then apply aggregation and explode to get the desired output.然后我们可以应用聚合和分解以获得所需的 output。
df2.start = df2.apply(lambda x: df.loc[(x['start'] <= df.start) & (df.start <= x['end'])].start.agg(list), axis=1)
print(df2.explode('start'))
index col start end
0 chr1 a 3 4
1 chr3 b 3 5
1 chr3 b 5 5
Edit: I realized that I was doing the incorrect operation comparing df2
values instead of df
.编辑:我意识到我在比较
df2
值而不是df
时做的操作不正确。 The edited code now replaces df2.start
with all df.start
values that fall between df2.start
and df2.end
for rows of df2
.编辑后的代码现在将
df2.start
替换为df2
行的df2.start
和df2.end
之间的所有df.start
值。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.