简体   繁体   English

如何在没有for循环的情况下比较pandas中的两个数据帧?

[英]How to compare two dataframes in pandas without for loop?

I want to compare two dataframes and find pairs of rows with the same sample , chr and family and the value in pos in just_r dataframe should be in range between just_f pos and just_f pos + 1000 .我想比较两个数据帧并找到具有相同samplechrfamily的行对, just_r dataframe 中pos的值应该在just_f posjust_f pos + 1000之间。 My solution is based on two loops with itertuples which is not effective (my data has thousands of rows and it takes so much time).我的解决方案基于两个带有 itertuples 的循环,但效果不佳(我的数据有数千行,需要花费很多时间)。 Maybe someone could help me to find a more smart solution?也许有人可以帮我找到更聪明的解决方案? Here is the part of my input data, expected output and my code below.这是我的输入数据的一部分,预期为 output 和下面的代码。 Thanks a lot!非常感谢!

just_f只是_f

sample  chr pos strand  family  order   support comment frequency
2   NC_025812.2 9831    .   Tourist|7   Tourist F   -   0,562
2   NC_025812.2 12038   .   Tourist|7   Tourist F   -   1,000
5   NC_025812.2 12040   .   Tourist|7   Tourist F   -   1,000
12  NC_025812.2 12042   .   Tourist|7   Tourist F   -   1,000
11  NC_025812.2 30758   .   uc|32   uc  F   -   0,547
12  NC_025812.2 49544   .   uc|10   uc  F   -   0,112
11  NC_025812.2 56184   .   hAT|9   hAT F   -   0,997
5   NC_025812.2 56246   .   hAT|9   hAT F   -   0,756
3   NC_025812.2 56265   .   hAT|9   hAT F   -   1,000
12  NC_025812.2 56268   .   hAT|9   hAT F   -   1,000

just_r只是_r

5   NC_025812.2 12396   .   Tourist|7   Tourist R   -   0,975
2   NC_025812.2 12433   .   Tourist|7   Tourist R   -   0,935
12  NC_025812.2 12478   .   Tourist|7   Tourist R   -   0,887
12  NC_025812.2 28943   .   Tourist|7   Tourist R   -   0,610
5   NC_025812.2 28947   .   Tourist|7   Tourist R   -   0,490
2   NC_025812.2 51483   .   Mutator|24  Mutator R   -   0,422
5   NC_025812.2 56713   .   hAT|9   hAT R   -   0,925
11  NC_025812.2 56737   .   hAT|9   hAT R   -   1,000
3   NC_025812.2 56778   .   hAT|9   hAT R   -   0,891
12  NC_025812.2 56800   .   hAT|9   hAT R   -   0,965

f_r_pairs f_r_pairs

sample  chr pos strand  family  order   support comment frequency
2   NC_025812.2 12038   .   Tourist|7   Tourist F   -   1.0
2   NC_025812.2 12433   .   Tourist|7   Tourist R   -   0.935
5   NC_025812.2 12040   .   Tourist|7   Tourist F   -   1.0
5   NC_025812.2 12396   .   Tourist|7   Tourist R   -   0.975
12  NC_025812.2 12042   .   Tourist|7   Tourist F   -   1.0
12  NC_025812.2 12478   .   Tourist|7   Tourist R   -   0.887
11  NC_025812.2 56184   .   hAT|9   hAT F   -   0.997
11  NC_025812.2 56737   .   hAT|9   hAT R   -   1.0
5   NC_025812.2 56246   .   hAT|9   hAT F   -   0.756
5   NC_025812.2 56713   .   hAT|9   hAT R   -   0.925
3   NC_025812.2 56265   .   hAT|9   hAT F   -   1.0
3   NC_025812.2 56778   .   hAT|9   hAT R   -   0.891
12  NC_025812.2 56268   .   hAT|9   hAT F   -   1.0
12  NC_025812.2 56800   .   hAT|9   hAT R   -   0.965
import pandas as pd

df_raw = pd.read_csv('1-DH-to-12-RO.NC_teinsertions.txt', sep="\t", decimal=',')
df_sort = df_raw.sort_values(by=['chr', 'pos', 'sample'])

just_f = df_sort[(df_sort["support"] == 'F')]
just_r = df_sort[(df_sort["support"] == 'R')]

f_r_pairs = pd.DataFrame(columns=just_f.columns)

# choosing rows for reference TE insertions (having pairs with F and R in range 1000 bp)
for f in just_f.itertuples():
    for r in just_r.itertuples():
        if f.sample == r.sample and f.chr == r.chr and f.family == r.family and r.pos in range(f.pos, f.pos + 1000):
            f_r_pairs = f_r_pairs.append(pd.DataFrame([f]))
            f_r_pairs = f_r_pairs.append(pd.DataFrame([r]))

You can join the two dataframes based on the matching keys, then filter for the rows that satisfy the pos condition.您可以根据匹配的键连接两个数据帧,然后过滤满足pos条件的行。

There are 2 functions that you can use: join and merge .您可以使用 2 个函数: joinmerge merge is the more flexible one: merge是更灵活的:

f_r_pairts = (
    just_f.merge(just_r, on=["sample", "chr", "family"], suffixes=("_f", "_r"))
    .query("pos_f <= pos_r <= pos_f + 1000")
)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM