简体   繁体   中英

How to combine two pandas dataframes with a conditional?

There are two pandas dataframes I have which I would like to combine with a rule.

This is the first dataframe

import pandas as pd
df1 = pd.Dataframe()

df1 

rank    begin    end     labels
first   30953   31131    label1
first   31293   31435    label2
first   31436   31733    label4
first   31734   31754    label1
first   32841   33037    label3
second  33048   33456    label4
....

The second dataframe is only two columns, rank and start

df2

rank    start 
first   31333     
first   31434     
first   33039    
first   33123     
first   33125     

In the first dataframe df1 , the data has a begin and end . I would like to check whether the integer for start in df2 is within this range.

Here is the end result it should look like:

result

rank    start     labels
first   31333     label2
first   31434     label2
first   33039     NaN
first   33123     label4
first   33125     label4

The start==31333 is between the range 31293 to 31435 in df1 with label = label2 . The integer 31434 is also between the range 31293:31435 , so it also gets annotated with label2 . The value 33039 is not between any interval in df2 , so it gets a NaN value.

The rule by which these dataframes are combined is this:

(df2.start >= df1.begin) & (df2.start <= df1.end)

But also, each row must match the same rank value, eg each row must match the string first or second for this conditional.

Here is the code I was using to combine these two dataframes, but it doesn't scale very well at all:

from numpy import nan

def between_range(row):
    subset = df1.loc[(row["rank"] == df1.rank) & (row.start >= repeats.start) & (row.start <= repeats.end), :]
    if subset.empty:
        return np.nan
    return subset.labels

Is there another way to do this with merging (maybe on rank)? Any other pandas-based solution?

Try this code block

def match_labels(row):
    curr_df = df1[ (df1['rank']==row['rank']) & (df1['begin']<=row['start']) & (df1['end']>=row['start']) ]
    try:
        row['labels'] = curr_df['labels'].iloc[0]
    except:
        row['labels'] = np.NaN

    return row

result = df2.apply(lambda x:match_labels(x),axis=1)

Hope this helps

You can do everything quickly with a massive join if you can fit len(df1)*len(df2) rows of data into memory:

df = df2.merge(df1, how = 'left')
df = df.loc[(df.start >= df.begin) & (df.start <= df.end),['rank','start','labels']] # This gives you the correct label of every entry that does indeed belong to a label.
result = df2.merge(df, how = 'left') # This effectively adds the entries that do not belong to any label back into df.

This solution also takes care of cases when start falls in the range of more than one begin and end pair: in such cases, you will get as many rows as there are matching labels.

If you can't fit this into memory, you can try partitioning your data by rank : do this for just those with rank == 'first' , then rank == 'second' , and so on. begin , end and start : df = df2[(df2.start > 31000) & (df2.start <= 32000)].merge(df1[(df1.begin > 31000) & (df1.end <= 32000)], how = 'left') , for example.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM