[英]How to compare two dataframes in pandas without for loop?
I want to compare two dataframes and find pairs of rows with the same sample
, chr
and family
and the value in pos
in just_r
dataframe should be in range between just_f pos
and just_f pos + 1000
.我想比较两个数据帧并找到具有相同
sample
、 chr
和family
的行对, just_r
dataframe 中pos
的值应该在just_f pos
和just_f pos + 1000
之间。 My solution is based on two loops with itertuples which is not effective (my data has thousands of rows and it takes so much time).我的解决方案基于两个带有 itertuples 的循环,但效果不佳(我的数据有数千行,需要花费很多时间)。 Maybe someone could help me to find a more smart solution?
也许有人可以帮我找到更聪明的解决方案? Here is the part of my input data, expected output and my code below.
这是我的输入数据的一部分,预期为 output 和下面的代码。 Thanks a lot!
非常感谢!
just_f只是_f
sample chr pos strand family order support comment frequency
2 NC_025812.2 9831 . Tourist|7 Tourist F - 0,562
2 NC_025812.2 12038 . Tourist|7 Tourist F - 1,000
5 NC_025812.2 12040 . Tourist|7 Tourist F - 1,000
12 NC_025812.2 12042 . Tourist|7 Tourist F - 1,000
11 NC_025812.2 30758 . uc|32 uc F - 0,547
12 NC_025812.2 49544 . uc|10 uc F - 0,112
11 NC_025812.2 56184 . hAT|9 hAT F - 0,997
5 NC_025812.2 56246 . hAT|9 hAT F - 0,756
3 NC_025812.2 56265 . hAT|9 hAT F - 1,000
12 NC_025812.2 56268 . hAT|9 hAT F - 1,000
just_r只是_r
5 NC_025812.2 12396 . Tourist|7 Tourist R - 0,975
2 NC_025812.2 12433 . Tourist|7 Tourist R - 0,935
12 NC_025812.2 12478 . Tourist|7 Tourist R - 0,887
12 NC_025812.2 28943 . Tourist|7 Tourist R - 0,610
5 NC_025812.2 28947 . Tourist|7 Tourist R - 0,490
2 NC_025812.2 51483 . Mutator|24 Mutator R - 0,422
5 NC_025812.2 56713 . hAT|9 hAT R - 0,925
11 NC_025812.2 56737 . hAT|9 hAT R - 1,000
3 NC_025812.2 56778 . hAT|9 hAT R - 0,891
12 NC_025812.2 56800 . hAT|9 hAT R - 0,965
f_r_pairs f_r_pairs
sample chr pos strand family order support comment frequency
2 NC_025812.2 12038 . Tourist|7 Tourist F - 1.0
2 NC_025812.2 12433 . Tourist|7 Tourist R - 0.935
5 NC_025812.2 12040 . Tourist|7 Tourist F - 1.0
5 NC_025812.2 12396 . Tourist|7 Tourist R - 0.975
12 NC_025812.2 12042 . Tourist|7 Tourist F - 1.0
12 NC_025812.2 12478 . Tourist|7 Tourist R - 0.887
11 NC_025812.2 56184 . hAT|9 hAT F - 0.997
11 NC_025812.2 56737 . hAT|9 hAT R - 1.0
5 NC_025812.2 56246 . hAT|9 hAT F - 0.756
5 NC_025812.2 56713 . hAT|9 hAT R - 0.925
3 NC_025812.2 56265 . hAT|9 hAT F - 1.0
3 NC_025812.2 56778 . hAT|9 hAT R - 0.891
12 NC_025812.2 56268 . hAT|9 hAT F - 1.0
12 NC_025812.2 56800 . hAT|9 hAT R - 0.965
import pandas as pd
df_raw = pd.read_csv('1-DH-to-12-RO.NC_teinsertions.txt', sep="\t", decimal=',')
df_sort = df_raw.sort_values(by=['chr', 'pos', 'sample'])
just_f = df_sort[(df_sort["support"] == 'F')]
just_r = df_sort[(df_sort["support"] == 'R')]
f_r_pairs = pd.DataFrame(columns=just_f.columns)
# choosing rows for reference TE insertions (having pairs with F and R in range 1000 bp)
for f in just_f.itertuples():
for r in just_r.itertuples():
if f.sample == r.sample and f.chr == r.chr and f.family == r.family and r.pos in range(f.pos, f.pos + 1000):
f_r_pairs = f_r_pairs.append(pd.DataFrame([f]))
f_r_pairs = f_r_pairs.append(pd.DataFrame([r]))
You can join the two dataframes based on the matching keys, then filter for the rows that satisfy the pos
condition.您可以根据匹配的键连接两个数据帧,然后过滤满足
pos
条件的行。
There are 2 functions that you can use: join
and merge
.您可以使用 2 个函数:
join
和merge
。 merge
is the more flexible one: merge
是更灵活的:
f_r_pairts = (
just_f.merge(just_r, on=["sample", "chr", "family"], suffixes=("_f", "_r"))
.query("pos_f <= pos_r <= pos_f + 1000")
)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.