[英]How to combine two pandas dataframes with a conditional?
There are two pandas dataframes I have which I would like to combine with a rule. 我有两个熊猫数据框,我想将其与规则结合使用。
This is the first dataframe 这是第一个数据框
import pandas as pd
df1 = pd.Dataframe()
df1
rank begin end labels
first 30953 31131 label1
first 31293 31435 label2
first 31436 31733 label4
first 31734 31754 label1
first 32841 33037 label3
second 33048 33456 label4
....
The second dataframe is only two columns, rank
and start
第二个数据框只有两列,即
rank
和start
df2
rank start
first 31333
first 31434
first 33039
first 33123
first 33125
In the first dataframe df1
, the data has a begin
and end
. 在第一个数据帧
df1
,数据具有begin
和end
。 I would like to check whether the integer for start in df2
is within this range. 我想检查
df2
的整数是否在此范围内。
Here is the end result it should look like: 这是最终结果,看起来应该像这样:
result
rank start labels
first 31333 label2
first 31434 label2
first 33039 NaN
first 33123 label4
first 33125 label4
The start==31333
is between the range 31293
to 31435
in df1
with label = label2
. start==31333
31293
在df1
, label = label2
,范围在31293
到31435
之间。 The integer 31434
is also between the range 31293:31435
, so it also gets annotated with label2
. 整数
31434
也在范围31293:31435
,因此它也用label2
注释。 The value 33039
is not between any interval in df2
, so it gets a NaN
value. 值
33039
不在df2
任何间隔之间,因此它将获得NaN
值。
The rule by which these dataframes are combined is this: 组合这些数据帧的规则是:
(df2.start >= df1.begin) & (df2.start <= df1.end)
But also, each row must match the same rank value, eg each row must match the string first or second for this conditional. 而且,每行必须匹配相同的等级值,例如,对于该条件,每行必须首先匹配字符串。
Here is the code I was using to combine these two dataframes, but it doesn't scale very well at all: 这是我用来组合这两个数据帧的代码,但是根本无法很好地扩展:
from numpy import nan
def between_range(row):
subset = df1.loc[(row["rank"] == df1.rank) & (row.start >= repeats.start) & (row.start <= repeats.end), :]
if subset.empty:
return np.nan
return subset.labels
Is there another way to do this with merging (maybe on rank)? 还有另一种合并方法(也许在排名上)吗? Any other pandas-based solution?
还有其他基于熊猫的解决方案吗?
Try this code block 试试这个代码块
def match_labels(row):
curr_df = df1[ (df1['rank']==row['rank']) & (df1['begin']<=row['start']) & (df1['end']>=row['start']) ]
try:
row['labels'] = curr_df['labels'].iloc[0]
except:
row['labels'] = np.NaN
return row
result = df2.apply(lambda x:match_labels(x),axis=1)
Hope this helps 希望这可以帮助
You can do everything quickly with a massive join if you can fit len(df1)*len(df2)
rows of data into memory: 如果可以将
len(df1)*len(df2)
行数据放入内存,则可以通过大规模len(df1)*len(df2)
快速完成所有操作:
df = df2.merge(df1, how = 'left')
df = df.loc[(df.start >= df.begin) & (df.start <= df.end),['rank','start','labels']] # This gives you the correct label of every entry that does indeed belong to a label.
result = df2.merge(df, how = 'left') # This effectively adds the entries that do not belong to any label back into df.
This solution also takes care of cases when start
falls in the range of more than one begin
and end
pair: in such cases, you will get as many rows as there are matching labels. 此解决方案还可以处理
start
落在多个begin
和end
对之间的情况:在这种情况下,您将获得与匹配标签一样多的行。
If you can't fit this into memory, you can try partitioning your data by rank
: do this for just those with rank == 'first'
, then rank == 'second'
, and so on.如果您无法将其放入内存中,则可以尝试按
对数据进行分区rank
:仅对那些具有 rank == 'first'
,然后rank == 'second'
此类操作,依此类推。begin
, end
and start
: df = df2[(df2.start > 31000) & (df2.start <= 32000)].merge(df1[(df1.begin > 31000) & (df1.end <= 32000)], how = 'left')
, for example. begin
, end
和start
: df = df2[(df2.start > 31000) & (df2.start <= 32000)].merge(df1[(df1.begin > 31000) & (df1.end <= 32000)], how = 'left')
例如, df = df2[(df2.start > 31000) & (df2.start <= 32000)].merge(df1[(df1.begin > 31000) & (df1.end <= 32000)], how = 'left')
。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.