简体   繁体   English

如何将两个熊猫数据框与条件组合在一起?

[英]How to combine two pandas dataframes with a conditional?

There are two pandas dataframes I have which I would like to combine with a rule. 我有两个熊猫数据框,我想将其与规则结合使用。

This is the first dataframe 这是第一个数据框

import pandas as pd
df1 = pd.Dataframe()

df1 

rank    begin    end     labels
first   30953   31131    label1
first   31293   31435    label2
first   31436   31733    label4
first   31734   31754    label1
first   32841   33037    label3
second  33048   33456    label4
....

The second dataframe is only two columns, rank and start 第二个数据框只有两列,即rankstart

df2

rank    start 
first   31333     
first   31434     
first   33039    
first   33123     
first   33125     

In the first dataframe df1 , the data has a begin and end . 在第一个数据帧df1 ,数据具有beginend I would like to check whether the integer for start in df2 is within this range. 我想检查df2的整数是否在此范围内。

Here is the end result it should look like: 这是最终结果,看起来应该像这样:

result

rank    start     labels
first   31333     label2
first   31434     label2
first   33039     NaN
first   33123     label4
first   33125     label4

The start==31333 is between the range 31293 to 31435 in df1 with label = label2 . start==31333 31293df1label = label2 ,范围在3129331435之间。 The integer 31434 is also between the range 31293:31435 , so it also gets annotated with label2 . 整数31434也在范围31293:31435 ,因此它也用label2注释。 The value 33039 is not between any interval in df2 , so it gets a NaN value. 33039不在df2任何间隔之间,因此它将获得NaN值。

The rule by which these dataframes are combined is this: 组合这些数据帧的规则是:

(df2.start >= df1.begin) & (df2.start <= df1.end)

But also, each row must match the same rank value, eg each row must match the string first or second for this conditional. 而且,每行必须匹配相同的等级值,例如,对于该条件,每行必须首先匹配字符串。

Here is the code I was using to combine these two dataframes, but it doesn't scale very well at all: 这是我用来组合这两个数据帧的代码,但是根本无法很好地扩展:

from numpy import nan

def between_range(row):
    subset = df1.loc[(row["rank"] == df1.rank) & (row.start >= repeats.start) & (row.start <= repeats.end), :]
    if subset.empty:
        return np.nan
    return subset.labels

Is there another way to do this with merging (maybe on rank)? 还有另一种合并方法(也许在排名上)吗? Any other pandas-based solution? 还有其他基于熊猫的解决方案吗?

Try this code block 试试这个代码块

def match_labels(row):
    curr_df = df1[ (df1['rank']==row['rank']) & (df1['begin']<=row['start']) & (df1['end']>=row['start']) ]
    try:
        row['labels'] = curr_df['labels'].iloc[0]
    except:
        row['labels'] = np.NaN

    return row

result = df2.apply(lambda x:match_labels(x),axis=1)

Hope this helps 希望这可以帮助

You can do everything quickly with a massive join if you can fit len(df1)*len(df2) rows of data into memory: 如果可以将len(df1)*len(df2)行数据放入内存,则可以通过大规模len(df1)*len(df2)快速完成所有操作:

df = df2.merge(df1, how = 'left')
df = df.loc[(df.start >= df.begin) & (df.start <= df.end),['rank','start','labels']] # This gives you the correct label of every entry that does indeed belong to a label.
result = df2.merge(df, how = 'left') # This effectively adds the entries that do not belong to any label back into df.

This solution also takes care of cases when start falls in the range of more than one begin and end pair: in such cases, you will get as many rows as there are matching labels. 此解决方案还可以处理start落在多个beginend对之间的情况:在这种情况下,您将获得与匹配标签一样多的行。

If you can't fit this into memory, you can try partitioning your data by rank : do this for just those with rank == 'first' , then rank == 'second' , and so on. 如果您无法将其放入内存中,则可以尝试按rank对数据进行分区:仅对那些具有rank == 'first' ,然后rank == 'second'此类操作,依此类推。 begin , end and start : df = df2[(df2.start > 31000) & (df2.start <= 32000)].merge(df1[(df1.begin > 31000) & (df1.end <= 32000)], how = 'left') , for example. beginendstartdf = df2[(df2.start > 31000) & (df2.start <= 32000)].merge(df1[(df1.begin > 31000) & (df1.end <= 32000)], how = 'left')例如, df = df2[(df2.start > 31000) & (df2.start <= 32000)].merge(df1[(df1.begin > 31000) & (df1.end <= 32000)], how = 'left')

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM