[英]Fastest way to merge pandas dataframe on ranges
I have a dataframe A
我有一个
dataframe A
ip_address
0 13
1 5
2 20
3 11
.. ........
and another dataframe B
和另一个
dataframe B
lowerbound_ip_address upperbound_ip_address country
0 0 10 Australia
1 11 20 China
based on this I need to add a column in A
such that 基于此我需要在
A
添加一个列
ip_address country
13 China
5 Australia
I have an idea that I should write define a function and then call map on each row of A. But how would I search through each row of B for this. 我有一个想法,我应该编写定义一个函数,然后在A的每一行调用map。但是我如何搜索B的每一行。 Is there a better way to do this.
有一个更好的方法吗。
Use pd.IntervalIndex
使用
pd.IntervalIndex
In [2503]: s = pd.IntervalIndex.from_arrays(dfb.lowerbound_ip_address,
dfb.upperbound_ip_address, 'both')
In [2504]: dfa.assign(country=dfb.set_index(s).loc[dfa.ip_address].country.values)
Out[2504]:
ip_address country
0 13 China
1 5 Australia
2 20 China
3 11 China
Details 细节
In [2505]: s
Out[2505]:
IntervalIndex([[0, 10], [11, 20]]
closed='both',
dtype='interval[int64]')
In [2507]: dfb.set_index(s)
Out[2507]:
lowerbound_ip_address upperbound_ip_address country
[0, 10] 0 10 Australia
[11, 20] 11 20 China
In [2506]: dfb.set_index(s).loc[dfa.ip_address]
Out[2506]:
lowerbound_ip_address upperbound_ip_address country
[11, 20] 11 20 China
[0, 10] 0 10 Australia
[11, 20] 11 20 China
[11, 20] 11 20 China
Setup 设定
In [2508]: dfa
Out[2508]:
ip_address
0 13
1 5
2 20
3 11
In [2509]: dfb
Out[2509]:
lowerbound_ip_address upperbound_ip_address country
0 0 10 Australia
1 11 20 China
Try pd.merge_asof
试试
pd.merge_asof
df['lowerbound_ip_address']=df['ip_address']
pd.merge_asof(df1,df,on='lowerbound_ip_address',direction ='forward',allow_exact_matches =False)
Out[811]:
lowerbound_ip_address upperbound_ip_address country ip_address
0 0 10 Australia 5
1 11 20 China 13
IntervalIndex is as of pandas 0.20.0 and the solution by @JohnGalt using it is excellent. IntervalIndex与pandas 0.20.0相同,而@JohnGalt使用它的解决方案非常出色。
Prior to that version, this solution would work which expands the ip addresses by country for the complete range. 在该版本之前,此解决方案可以在整个范围内按国家/地区扩展IP地址。
df_ip = pd.concat([pd.DataFrame(
{'ip_address': range(row['lowerbound_ip_address'], row['upperbound_ip_address'] + 1),
'country': row['country']})
for _, row in dfb.iterrows()]).set_index('ip_address')
>>> dfa.set_index('ip_address').join(df_ip)
country
ip_address
13 China
5 Australia
20 China
11 China
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.