There are two pandas dataframes I have which I would like to combine with a rule.
This is the first dataframe
import pandas as pd
df1 = pd.Dataframe()
df1
rank begin end labels
first 30953 31131 label1
first 31293 31435 label2
first 31436 31733 label4
first 31734 31754 label1
first 32841 33037 label3
second 33048 33456 label4
....
The second dataframe is only two columns, rank
and start
df2
rank start
first 31333
first 31434
first 33039
first 33123
first 33125
In the first dataframe df1
, the data has a begin
and end
. I would like to check whether the integer for start in df2
is within this range.
Here is the end result it should look like:
result
rank start labels
first 31333 label2
first 31434 label2
first 33039 NaN
first 33123 label4
first 33125 label4
The start==31333
is between the range 31293
to 31435
in df1
with label = label2
. The integer 31434
is also between the range 31293:31435
, so it also gets annotated with label2
. The value 33039
is not between any interval in df2
, so it gets a NaN
value.
The rule by which these dataframes are combined is this:
(df2.start >= df1.begin) & (df2.start <= df1.end)
But also, each row must match the same rank value, eg each row must match the string first or second for this conditional.
Here is the code I was using to combine these two dataframes, but it doesn't scale very well at all:
from numpy import nan
def between_range(row):
subset = df1.loc[(row["rank"] == df1.rank) & (row.start >= repeats.start) & (row.start <= repeats.end), :]
if subset.empty:
return np.nan
return subset.labels
Is there another way to do this with merging (maybe on rank)? Any other pandas-based solution?
Try this code block
def match_labels(row):
curr_df = df1[ (df1['rank']==row['rank']) & (df1['begin']<=row['start']) & (df1['end']>=row['start']) ]
try:
row['labels'] = curr_df['labels'].iloc[0]
except:
row['labels'] = np.NaN
return row
result = df2.apply(lambda x:match_labels(x),axis=1)
Hope this helps
You can do everything quickly with a massive join if you can fit len(df1)*len(df2)
rows of data into memory:
df = df2.merge(df1, how = 'left')
df = df.loc[(df.start >= df.begin) & (df.start <= df.end),['rank','start','labels']] # This gives you the correct label of every entry that does indeed belong to a label.
result = df2.merge(df, how = 'left') # This effectively adds the entries that do not belong to any label back into df.
This solution also takes care of cases when start
falls in the range of more than one begin
and end
pair: in such cases, you will get as many rows as there are matching labels.
If you can't fit this into memory, you can try partitioning your data by rank
: do this for just those with rank == 'first'
, then rank == 'second'
, and so on.begin
, end
and start
: df = df2[(df2.start > 31000) & (df2.start <= 32000)].merge(df1[(df1.begin > 31000) & (df1.end <= 32000)], how = 'left')
, for example.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.