[英]How to join two dataframes for which column values are within a certain range?
Given two dataframes df_1
and df_2
, how to join them such that datetime column df_1
is in between start
and end
in dataframe df_2
:给定两个数据帧
df_1
和df_2
,如何连接它们,使得日期时间列df_1
位于 dataframe df_2
的start
和end
之间:
print df_1
timestamp A B
0 2016-05-14 10:54:33 0.020228 0.026572
1 2016-05-14 10:54:34 0.057780 0.175499
2 2016-05-14 10:54:35 0.098808 0.620986
3 2016-05-14 10:54:36 0.158789 1.014819
4 2016-05-14 10:54:39 0.038129 2.384590
print df_2
start end event
0 2016-05-14 10:54:31 2016-05-14 10:54:33 E1
1 2016-05-14 10:54:34 2016-05-14 10:54:37 E2
2 2016-05-14 10:54:38 2016-05-14 10:54:42 E3
Get corresponding event
where df1.timestamp
is between df_2.start
and df2.end
获取
df1.timestamp
在df_2.start
和df2.end
之间的对应event
timestamp A B event
0 2016-05-14 10:54:33 0.020228 0.026572 E1
1 2016-05-14 10:54:34 0.057780 0.175499 E2
2 2016-05-14 10:54:35 0.098808 0.620986 E2
3 2016-05-14 10:54:36 0.158789 1.014819 E2
4 2016-05-14 10:54:39 0.038129 2.384590 E3
One simple solution is create interval index
from start and end
setting closed = both
then use get_loc
to get the event ie (Hope all the date times are in timestamps dtype )一个简单的解决方案是从
start and end
设置closed = both
创建interval index
= 然后使用get_loc
来获取事件,即(希望所有日期时间都在时间戳 dtype 中)
df_2.index = pd.IntervalIndex.from_arrays(df_2['start'],df_2['end'],closed='both')
df_1['event'] = df_1['timestamp'].apply(lambda x : df_2.iloc[df_2.index.get_loc(x)]['event'])
Output :输出 :
timestamp A B event 0 2016-05-14 10:54:33 0.020228 0.026572 E1 1 2016-05-14 10:54:34 0.057780 0.175499 E2 2 2016-05-14 10:54:35 0.098808 0.620986 E2 3 2016-05-14 10:54:36 0.158789 1.014819 E2 4 2016-05-14 10:54:39 0.038129 2.384590 E3
First use IntervalIndex to create a reference index based on the interval of interest, then use get_indexer to slice the dataframe which contains the discrete events of interest.首先使用 IntervalIndex 根据感兴趣的区间创建参考索引,然后使用 get_indexer 对包含感兴趣的离散事件的数据帧进行切片。
idx = pd.IntervalIndex.from_arrays(df_2['start'], df_2['end'], closed='both')
event = df_2.iloc[idx.get_indexer(df_1.timestamp), 'event']
event
0 E1
1 E2
1 E2
1 E2
2 E3
Name: event, dtype: object
df_1['event'] = event.to_numpy()
df_1
timestamp A B event
0 2016-05-14 10:54:33 0.020228 0.026572 E1
1 2016-05-14 10:54:34 0.057780 0.175499 E2
2 2016-05-14 10:54:35 0.098808 0.620986 E2
3 2016-05-14 10:54:36 0.158789 1.014819 E2
4 2016-05-14 10:54:39 0.038129 2.384590 E3
Reference: A question on IntervalIndex.get_indexer.
参考:关于
IntervalIndex.get_indexer.
Option 1选项1
idx = pd.IntervalIndex.from_arrays(df_2['start'], df_2['end'], closed='both')
df_2.index=idx
df_1['event']=df_2.loc[df_1.timestamp,'event'].values
Option 2选项 2
df_2['timestamp']=df_2['end']
pd.merge_asof(df_1,df_2[['timestamp','event']],on='timestamp',direction ='forward',allow_exact_matches =True)
Out[405]:
timestamp A B event
0 2016-05-14 10:54:33 0.020228 0.026572 E1
1 2016-05-14 10:54:34 0.057780 0.175499 E2
2 2016-05-14 10:54:35 0.098808 0.620986 E2
3 2016-05-14 10:54:36 0.158789 1.014819 E2
4 2016-05-14 10:54:39 0.038129 2.384590 E3
In this method, we assume TimeStamp objects are used.在此方法中,我们假设使用了 TimeStamp 对象。
df2 start end event
0 2016-05-14 10:54:31 2016-05-14 10:54:33 E1
1 2016-05-14 10:54:34 2016-05-14 10:54:37 E2
2 2016-05-14 10:54:38 2016-05-14 10:54:42 E3
event_num = len(df2.event)
def get_event(t):
event_idx = ((t >= df2.start) & (t <= df2.end)).dot(np.arange(event_num))
return df2.event[event_idx]
df1["event"] = df1.timestamp.transform(get_event)
Explanation of get_event
get_event
的解释
For each timestamp in df1
, say t0 = 2016-05-14 10:54:33
,对于
df1
中的每个时间戳,例如t0 = 2016-05-14 10:54:33
,
(t0 >= df2.start) & (t0 <= df2.end)
will contain 1 true. (t0 >= df2.start) & (t0 <= df2.end)
将包含 1 个真值。 (See example 1). (参见示例 1)。 Then, take a dot product with
np.arange(event_num)
to get the index of the event that a t0
belongs to.然后,与
np.arange(event_num)
进行点积,得到t0
所属事件的索引。
Examples:例子:
Example 1示例 1
t0 >= df2.start t0 <= df2.end After & np.arange(3)
0 True True -> T 0 event_idx
1 False True -> F 1 -> 0
2 False True -> F 2
Take t2 = 2016-05-14 10:54:35
for another example再举个例子
t2 = 2016-05-14 10:54:35
t2 >= df2.start t2 <= df2.end After & np.arange(3)
0 True False -> F 0 event_idx
1 True True -> T 1 -> 1
2 False True -> F 2
We finally use transform
to transform each timestamp into an event.我们最终使用
transform
将每个时间戳转换为一个事件。
You can make pandas
index alignment work for you by the expedient of setting df_1
's index to the timestamp field您可以通过将
df_1
的索引设置为时间戳字段的权宜之计来使pandas
索引对齐为您工作
import pandas as pd
df_1 = pd.DataFrame(
columns=["timestamp", "A", "B"],
data=[
(pd.Timestamp("2016-05-14 10:54:33"), 0.020228, 0.026572),
(pd.Timestamp("2016-05-14 10:54:34"), 0.057780, 0.175499),
(pd.Timestamp("2016-05-14 10:54:35"), 0.098808, 0.620986),
(pd.Timestamp("2016-05-14 10:54:36"), 0.158789, 1.014819),
(pd.Timestamp("2016-05-14 10:54:39"), 0.038129, 2.384590),
],
)
df_2 = pd.DataFrame(
columns=["start", "end", "event"],
data=[
(
pd.Timestamp("2016-05-14 10:54:31"),
pd.Timestamp("2016-05-14 10:54:33"),
"E1",
),
(
pd.Timestamp("2016-05-14 10:54:34"),
pd.Timestamp("2016-05-14 10:54:37"),
"E2",
),
(
pd.Timestamp("2016-05-14 10:54:38"),
pd.Timestamp("2016-05-14 10:54:42"),
"E3",
),
],
)
df_2.index = pd.IntervalIndex.from_arrays(df_2["start"], df_2["end"], closed="both")
Just set df_1["event"]
to df_2["event"]
只需将
df_1["event"]
设置为df_2["event"]
df_1["event"] = df_2["event"]
and voila瞧
df_1["event"]
timestamp
2016-05-14 10:54:33 E1
2016-05-14 10:54:34 E2
2016-05-14 10:54:35 E2
2016-05-14 10:54:36 E2
2016-05-14 10:54:39 E3
Name: event, dtype: object
In the solution by firelynx here on StackOverflow , that suggests that Polymorphism does not work.在StackOverflow 上 firelynx的解决方案中,这表明多态性不起作用。 I have to agree with firelynx (after extensive testing).
我必须同意 firelynx(经过广泛测试)。 However, combining that idea of Polymorphism with the numpy broadcasting solution of piRSquared , it can work!
但是,将多态性的想法与piRSquared 的 numpy 广播解决方案相结合,它可以工作!
The only problem is that in the end, under the hood, the numpy broadcasting does actually do some sort of cross-join where we filter all elements that are equal, giving an O(n1*n2)
memory and O(n1*n2)
performance hit.唯一的问题是,最后,在幕后,numpy 广播确实做了某种交叉连接,我们过滤所有相等的元素,给出
O(n1*n2)
内存和O(n1*n2)
性能受到打击。 Probably, there is someone who can make this more efficient in a generic sense.可能有人可以在一般意义上使这更有效。
The reason I post here is that the question of the solution by firelynx is closed as a duplicate of this question, where I tend to disagree.我在这里发帖的原因是 firelynx 的解决方案问题作为这个问题的副本被关闭,我倾向于不同意。 Because this question and the answers therein do not give a solution when you have multiple points belonging to multiple intervals, but only for one point belonging to multiple intervals.
因为当您有多个点属于多个区间时,这个问题和其中的答案并没有给出解决方案,而只是针对属于多个区间的一个点。 The solution I propose below, does take care of these nm relations.
我在下面提出的解决方案确实处理了这些 nm 关系。
Basically, create the two following classes PointInTime
and Timespan
for the Polymorphism.基本上,为多态创建以下两个类
PointInTime
和Timespan
。
from datetime import datetime
class PointInTime(object):
doPrint = True
def __init__(self, year, month, day):
self.dt = datetime(year, month, day)
def __eq__(self, other):
if isinstance(other, self.__class__):
r = (self.dt == other.dt)
if self.doPrint:
print(f'{self.__class__}: comparing {self} to {other} (equals) gives {r}')
return (r)
elif isinstance(other, Timespan):
r = (other.start_date < self.dt < other.end_date)
if self.doPrint:
print(f'{self.__class__}: comparing {self} to {other} (Timespan in PointInTime) gives {r}')
return (r)
else:
if self.doPrint:
print(f'Not implemented... (PointInTime)')
return NotImplemented
def __repr__(self):
return "{}-{}-{}".format(self.dt.year, self.dt.month, self.dt.day)
class Timespan(object):
doPrint = True
def __init__(self, start_date, end_date):
self.start_date = start_date
self.end_date = end_date
def __eq__(self, other):
if isinstance(other, self.__class__):
r = ((self.start_date == other.start_date) and (self.end_date == other.end_date))
if self.doPrint:
print(f'{self.__class__}: comparing {self} to {other} (equals) gives {r}')
return (r)
elif isinstance (other, PointInTime):
r = self.start_date < other.dt < self.end_date
if self.doPrint:
print(f'{self.__class__}: comparing {self} to {other} (PointInTime in Timespan) gives {r}')
return (r)
else:
if self.doPrint:
print(f'Not implemented... (Timespan)')
return NotImplemented
def __repr__(self):
return "{}-{}-{} -> {}-{}-{}".format(self.start_date.year, self.start_date.month, self.start_date.day, self.end_date.year, self.end_date.month, self.end_date.day)
BTW, if you wish to not use ==, but other operators (such as !=, <, >, <=, >=) you can create the respective function for them ( __ne__
, __lt__
, __gt__
, __le__
, __ge__
).顺便说一句,如果您不希望使用 ==,而希望使用其他运算符(例如 !=、<、>、<=、>=),您可以为它们创建相应的函数(
__ne__
、 __lt__
、 __gt__
、 __le__
、 __ge__
)。
The way you can use this in combination with the broadcasting is as follows.您可以将其与广播结合使用的方式如下。
import pandas as pd
import numpy as np
df1 = pd.DataFrame({"pit":[(x) for x in [PointInTime(2015,1,1), PointInTime(2015,2,2), PointInTime(2015,3,3), PointInTime(2015,4,4)]], 'vals1':[1,2,3,4]})
df2 = pd.DataFrame({"ts":[(x) for x in [Timespan(datetime(2015,2,1), datetime(2015,2,5)), Timespan(datetime(2015,2,1), datetime(2015,4,1)), Timespan(datetime(2015,2,1), datetime(2015,2,5))]], 'vals2' : ['a', 'b', 'c']})
a = df1['pit'].values
b = df2['ts'].values
i, j = np.where((a[:,None] == b))
res = pd.DataFrame(
np.column_stack([df1.values[i], df2.values[j]]),
columns=df1.columns.append(df2.columns)
)
print(df1)
print(df2)
print(res)
This gives the output as expected.这给出了预期的输出。
<class '__main__.PointInTime'>: comparing 2015-1-1 to 2015-2-1 -> 2015-2-5 (Timespan in PointInTime) gives False
<class '__main__.PointInTime'>: comparing 2015-1-1 to 2015-2-1 -> 2015-4-1 (Timespan in PointInTime) gives False
<class '__main__.PointInTime'>: comparing 2015-1-1 to 2015-2-1 -> 2015-2-5 (Timespan in PointInTime) gives False
<class '__main__.PointInTime'>: comparing 2015-2-2 to 2015-2-1 -> 2015-2-5 (Timespan in PointInTime) gives True
<class '__main__.PointInTime'>: comparing 2015-2-2 to 2015-2-1 -> 2015-4-1 (Timespan in PointInTime) gives True
<class '__main__.PointInTime'>: comparing 2015-2-2 to 2015-2-1 -> 2015-2-5 (Timespan in PointInTime) gives True
<class '__main__.PointInTime'>: comparing 2015-3-3 to 2015-2-1 -> 2015-2-5 (Timespan in PointInTime) gives False
<class '__main__.PointInTime'>: comparing 2015-3-3 to 2015-2-1 -> 2015-4-1 (Timespan in PointInTime) gives True
<class '__main__.PointInTime'>: comparing 2015-3-3 to 2015-2-1 -> 2015-2-5 (Timespan in PointInTime) gives False
<class '__main__.PointInTime'>: comparing 2015-4-4 to 2015-2-1 -> 2015-2-5 (Timespan in PointInTime) gives False
<class '__main__.PointInTime'>: comparing 2015-4-4 to 2015-2-1 -> 2015-4-1 (Timespan in PointInTime) gives False
<class '__main__.PointInTime'>: comparing 2015-4-4 to 2015-2-1 -> 2015-2-5 (Timespan in PointInTime) gives False
pit vals1
0 2015-1-1 1
1 2015-2-2 2
2 2015-3-3 3
3 2015-4-4 4
ts vals2
0 2015-2-1 -> 2015-2-5 a
1 2015-2-1 -> 2015-4-1 b
2 2015-2-1 -> 2015-2-5 c
pit vals1 ts vals2
0 2015-2-2 2 2015-2-1 -> 2015-2-5 a
1 2015-2-2 2 2015-2-1 -> 2015-4-1 b
2 2015-2-2 2 2015-2-1 -> 2015-2-5 c
3 2015-3-3 3 2015-2-1 -> 2015-4-1 b
Probably the overhead of having the classes might have an additional performance loss compared to basic Python types, but I have not looked into that.与基本的 Python 类型相比,拥有类的开销可能会带来额外的性能损失,但我没有对此进行研究。
The above is how we create the "inner" join.以上是我们如何创建“内部”连接。 It should be straightforward to create the "(outer) left", "(outer) right" and "(full) outer" joins.
创建“(外)左”、“(外)右”和“(全)外”连接应该很简单。
If the timespans in df_2
are not overlapping, you can use numpy broadcasting to compare the timestamp with all of the timespans and determine which timespan it falls between.如果
df_2
中的时间跨度不重叠,您可以使用 numpy 广播将时间戳与所有时间跨度进行比较,并确定它位于哪个时间跨度之间。 Then use argmax
to figure out which 'Event'
to assign (since there can only be at most 1 with non-overlapping timespans).然后使用
argmax
来确定要分配哪个'Event'
(因为最多只能有 1 个不重叠的时间跨度)。
The where
condition is used to NaN
any that could have fallen outside of all timespans (since argmax
won't deal with this properly) where
条件用于NaN
任何可能超出所有时间跨度的内容(因为argmax
无法正确处理此问题)
import numpy as np
m = ((df_1['timestamp'].to_numpy() >= df_2['start'].to_numpy()[:, None])
& (df_1['timestamp'].to_numpy() <= df_2['end'].to_numpy()[:, None]))
df_1['Event'] = df_2['event'].take(np.argmax(m, axis=0)).where(m.sum(axis=0) > 0)
print(df_1)
timestamp A B Event
0 2016-05-14 10:54:33 0.020228 0.026572 E1
1 2016-05-14 10:54:34 0.057780 0.175499 E2
2 2016-05-14 10:54:35 0.098808 0.620986 E2
3 2016-05-14 10:54:36 0.158789 1.014819 E2
4 2016-05-14 10:54:39 0.038129 2.384590 E3
One option is with the conditional_join from pyjanitor :一种选择是使用pyjanitor的conditional_join :
# pip install pyjanitor
import pandas as pd
import janitor
(df_1
.conditional_join(
df_2,
# variable arguments
# tuple is of the form:
# col_from_left_df, col_from_right_df, comparator
('timestamp', 'start', '>='),
('timestamp', 'end', '<='),
how = 'inner',
sort_by_appearance = False)
.drop(columns=['start', 'end'])
)
timestamp A B event
0 2016-05-14 10:54:33 0.020228 0.026572 E1
1 2016-05-14 10:54:34 0.057780 0.175499 E2
2 2016-05-14 10:54:35 0.098808 0.620986 E2
3 2016-05-14 10:54:36 0.158789 1.014819 E2
4 2016-05-14 10:54:39 0.038129 2.384590 E3
You can decide the join type => left
, right
, or inner
, with the how
parameter.您可以使用
how
参数决定连接类型 => left
、 right
或inner
。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.