简体   繁体   English

如何连接列值在一定范围内的两个数据框?

[英]How to join two dataframes for which column values are within a certain range?

Given two dataframes df_1 and df_2 , how to join them such that datetime column df_1 is in between start and end in dataframe df_2 :给定两个数据帧df_1df_2 ,如何连接它们,使得日期时间列df_1位于 dataframe df_2startend之间:

print df_1

  timestamp              A          B
0 2016-05-14 10:54:33    0.020228   0.026572
1 2016-05-14 10:54:34    0.057780   0.175499
2 2016-05-14 10:54:35    0.098808   0.620986
3 2016-05-14 10:54:36    0.158789   1.014819
4 2016-05-14 10:54:39    0.038129   2.384590


print df_2

  start                end                  event    
0 2016-05-14 10:54:31  2016-05-14 10:54:33  E1
1 2016-05-14 10:54:34  2016-05-14 10:54:37  E2
2 2016-05-14 10:54:38  2016-05-14 10:54:42  E3

Get corresponding event where df1.timestamp is between df_2.start and df2.end获取df1.timestampdf_2.startdf2.end之间的对应event

  timestamp              A          B          event
0 2016-05-14 10:54:33    0.020228   0.026572   E1
1 2016-05-14 10:54:34    0.057780   0.175499   E2
2 2016-05-14 10:54:35    0.098808   0.620986   E2
3 2016-05-14 10:54:36    0.158789   1.014819   E2
4 2016-05-14 10:54:39    0.038129   2.384590   E3

One simple solution is create interval index from start and end setting closed = both then use get_loc to get the event ie (Hope all the date times are in timestamps dtype )一个简单的解决方案是从start and end设置closed = both创建interval index = 然后使用get_loc来获取事件,即(希望所有日期时间都在时间戳 dtype 中)

df_2.index = pd.IntervalIndex.from_arrays(df_2['start'],df_2['end'],closed='both')
df_1['event'] = df_1['timestamp'].apply(lambda x : df_2.iloc[df_2.index.get_loc(x)]['event'])

Output :输出 :

timestamp         A         B event
0 2016-05-14 10:54:33  0.020228  0.026572    E1
1 2016-05-14 10:54:34  0.057780  0.175499    E2
2 2016-05-14 10:54:35  0.098808  0.620986    E2
3 2016-05-14 10:54:36  0.158789  1.014819    E2
4 2016-05-14 10:54:39  0.038129  2.384590    E3

First use IntervalIndex to create a reference index based on the interval of interest, then use get_indexer to slice the dataframe which contains the discrete events of interest.首先使用 IntervalIndex 根据感兴趣的区间创建参考索引,然后使用 get_indexer 对包含感兴趣的离散事件的数据帧进行切片。

idx = pd.IntervalIndex.from_arrays(df_2['start'], df_2['end'], closed='both')
event = df_2.iloc[idx.get_indexer(df_1.timestamp), 'event']

event
0    E1
1    E2
1    E2
1    E2
2    E3
Name: event, dtype: object

df_1['event'] = event.to_numpy()
df_1
            timestamp         A         B event
0 2016-05-14 10:54:33  0.020228  0.026572    E1
1 2016-05-14 10:54:34  0.057780  0.175499    E2
2 2016-05-14 10:54:35  0.098808  0.620986    E2
3 2016-05-14 10:54:36  0.158789  1.014819    E2
4 2016-05-14 10:54:39  0.038129  2.384590    E3

Reference: A question on IntervalIndex.get_indexer.参考:关于IntervalIndex.get_indexer.

You can use the module pandasql您可以使用模块pandasql

import pandasql as ps

sqlcode = '''
select df_1.timestamp
,df_1.A
,df_1.B
,df_2.event
from df_1 
inner join df_2 
on d1.timestamp between df_2.start and df2.end
'''

newdf = ps.sqldf(sqlcode,locals())

Option 1选项1

idx = pd.IntervalIndex.from_arrays(df_2['start'], df_2['end'], closed='both')
df_2.index=idx
df_1['event']=df_2.loc[df_1.timestamp,'event'].values

Option 2选项 2

df_2['timestamp']=df_2['end']
pd.merge_asof(df_1,df_2[['timestamp','event']],on='timestamp',direction ='forward',allow_exact_matches =True)
Out[405]: 
            timestamp         A         B event
0 2016-05-14 10:54:33  0.020228  0.026572    E1
1 2016-05-14 10:54:34  0.057780  0.175499    E2
2 2016-05-14 10:54:35  0.098808  0.620986    E2
3 2016-05-14 10:54:36  0.158789  1.014819    E2
4 2016-05-14 10:54:39  0.038129  2.384590    E3

In this method, we assume TimeStamp objects are used.在此方法中,我们假设使用了 TimeStamp 对象。

df2  start                end                  event    
   0 2016-05-14 10:54:31  2016-05-14 10:54:33  E1
   1 2016-05-14 10:54:34  2016-05-14 10:54:37  E2
   2 2016-05-14 10:54:38  2016-05-14 10:54:42  E3

event_num = len(df2.event)

def get_event(t):    
    event_idx = ((t >= df2.start) & (t <= df2.end)).dot(np.arange(event_num))
    return df2.event[event_idx]

df1["event"] = df1.timestamp.transform(get_event)

Explanation of get_event get_event的解释

For each timestamp in df1 , say t0 = 2016-05-14 10:54:33 ,对于df1中的每个时间戳,例如t0 = 2016-05-14 10:54:33

(t0 >= df2.start) & (t0 <= df2.end) will contain 1 true. (t0 >= df2.start) & (t0 <= df2.end)将包含 1 个真值。 (See example 1). (参见示例 1)。 Then, take a dot product with np.arange(event_num) to get the index of the event that a t0 belongs to.然后,与np.arange(event_num)进行点积,得到t0所属事件的索引。

Examples:例子:

Example 1示例 1

    t0 >= df2.start    t0 <= df2.end     After &     np.arange(3)    
0     True                True         ->  T              0        event_idx
1    False                True         ->  F              1     ->     0
2    False                True         ->  F              2

Take t2 = 2016-05-14 10:54:35 for another example再举个例子t2 = 2016-05-14 10:54:35

    t2 >= df2.start    t2 <= df2.end     After &     np.arange(3)    
0     True                False        ->  F              0        event_idx
1     True                True         ->  T              1     ->     1
2    False                True         ->  F              2

We finally use transform to transform each timestamp into an event.我们最终使用transform将每个时间戳转换为一个事件。

You can make pandas index alignment work for you by the expedient of setting df_1 's index to the timestamp field您可以通过将df_1的索引设置为时间戳字段的权宜之计来使pandas索引对齐为您工作

import pandas as pd

df_1 = pd.DataFrame(
    columns=["timestamp", "A", "B"],
    data=[
        (pd.Timestamp("2016-05-14 10:54:33"), 0.020228, 0.026572),
        (pd.Timestamp("2016-05-14 10:54:34"), 0.057780, 0.175499),
        (pd.Timestamp("2016-05-14 10:54:35"), 0.098808, 0.620986),
        (pd.Timestamp("2016-05-14 10:54:36"), 0.158789, 1.014819),
        (pd.Timestamp("2016-05-14 10:54:39"), 0.038129, 2.384590),
    ],
)
df_2 = pd.DataFrame(
    columns=["start", "end", "event"],
    data=[
        (
            pd.Timestamp("2016-05-14 10:54:31"),
            pd.Timestamp("2016-05-14 10:54:33"),
            "E1",
        ),
        (
            pd.Timestamp("2016-05-14 10:54:34"),
            pd.Timestamp("2016-05-14 10:54:37"),
            "E2",
        ),
        (
            pd.Timestamp("2016-05-14 10:54:38"),
            pd.Timestamp("2016-05-14 10:54:42"),
            "E3",
        ),
    ],
)
df_2.index = pd.IntervalIndex.from_arrays(df_2["start"], df_2["end"], closed="both")

Just set df_1["event"] to df_2["event"]只需将df_1["event"]设置为df_2["event"]

df_1["event"] = df_2["event"]

and voila

df_1["event"]

timestamp
2016-05-14 10:54:33    E1
2016-05-14 10:54:34    E2
2016-05-14 10:54:35    E2
2016-05-14 10:54:36    E2
2016-05-14 10:54:39    E3
Name: event, dtype: object

In the solution by firelynx here on StackOverflow , that suggests that Polymorphism does not work.StackOverflow 上 firelynx的解决方案中,这表明多态性不起作用。 I have to agree with firelynx (after extensive testing).我必须同意 firelynx(经过广泛测试)。 However, combining that idea of Polymorphism with the numpy broadcasting solution of piRSquared , it can work!但是,将多态性的想法与piRSquared 的 numpy 广播解决方案相结合,它可以工作!

The only problem is that in the end, under the hood, the numpy broadcasting does actually do some sort of cross-join where we filter all elements that are equal, giving an O(n1*n2) memory and O(n1*n2) performance hit.唯一的问题是,最后,在幕后,numpy 广播确实做了某种交叉连接,我们过滤所有相等的元素,给出O(n1*n2)内存和O(n1*n2)性能受到打击。 Probably, there is someone who can make this more efficient in a generic sense.可能有人可以在一般意义上使这更有效。

The reason I post here is that the question of the solution by firelynx is closed as a duplicate of this question, where I tend to disagree.我在这里发帖的原因是 firelynx 的解决方案问题作为这个问题的副本被关闭,我倾向于不同意。 Because this question and the answers therein do not give a solution when you have multiple points belonging to multiple intervals, but only for one point belonging to multiple intervals.因为当您有多个点属于多个区间时,这个问题和其中的答案并没有给出解决方案,而只是针对属于多个区间的一个点。 The solution I propose below, does take care of these nm relations.我在下面提出的解决方案确实处理了这些 nm 关系。

Basically, create the two following classes PointInTime and Timespan for the Polymorphism.基本上,为多态创建以下两个类PointInTimeTimespan

from datetime import datetime

class PointInTime(object):
    doPrint = True
    def __init__(self, year, month, day):
        self.dt = datetime(year, month, day)

    def __eq__(self, other):
        if isinstance(other, self.__class__):
            r = (self.dt == other.dt)
            if self.doPrint:
                print(f'{self.__class__}: comparing {self} to {other} (equals) gives {r}')
            return (r)
        elif isinstance(other, Timespan):
            r = (other.start_date < self.dt < other.end_date)
            if self.doPrint:
                print(f'{self.__class__}: comparing {self} to {other} (Timespan in PointInTime) gives {r}')
            return (r)
        else:
            if self.doPrint:
                print(f'Not implemented... (PointInTime)')
            return NotImplemented

    def __repr__(self):
        return "{}-{}-{}".format(self.dt.year, self.dt.month, self.dt.day)

class Timespan(object):
    doPrint = True
    def __init__(self, start_date, end_date):
        self.start_date = start_date
        self.end_date   = end_date

    def __eq__(self, other):
        if isinstance(other, self.__class__):
            r = ((self.start_date == other.start_date) and (self.end_date == other.end_date))
            if self.doPrint:
                print(f'{self.__class__}: comparing {self} to {other} (equals) gives {r}')
            return (r)
        elif isinstance (other, PointInTime):
            r = self.start_date < other.dt < self.end_date
            if self.doPrint:
                print(f'{self.__class__}: comparing {self} to {other} (PointInTime in Timespan) gives {r}')
            return (r)
        else:
            if self.doPrint:
                print(f'Not implemented... (Timespan)')
            return NotImplemented

    def __repr__(self):
        return "{}-{}-{} -> {}-{}-{}".format(self.start_date.year, self.start_date.month, self.start_date.day, self.end_date.year, self.end_date.month, self.end_date.day)

BTW, if you wish to not use ==, but other operators (such as !=, <, >, <=, >=) you can create the respective function for them ( __ne__ , __lt__ , __gt__ , __le__ , __ge__ ).顺便说一句,如果您不希望使用 ==,而希望使用其他运算符(例如 !=、<、>、<=、>=),您可以为它们创建相应的函数( __ne____lt____gt____le____ge__ )。

The way you can use this in combination with the broadcasting is as follows.您可以将其与广播结合使用的方式如下。

import pandas as pd
import numpy as np

df1 = pd.DataFrame({"pit":[(x) for x in [PointInTime(2015,1,1), PointInTime(2015,2,2), PointInTime(2015,3,3), PointInTime(2015,4,4)]], 'vals1':[1,2,3,4]})
df2 = pd.DataFrame({"ts":[(x) for x in [Timespan(datetime(2015,2,1), datetime(2015,2,5)), Timespan(datetime(2015,2,1), datetime(2015,4,1)), Timespan(datetime(2015,2,1), datetime(2015,2,5))]], 'vals2' : ['a', 'b', 'c']})
a = df1['pit'].values
b = df2['ts'].values
i, j = np.where((a[:,None] == b))

res = pd.DataFrame(
    np.column_stack([df1.values[i], df2.values[j]]),
    columns=df1.columns.append(df2.columns)
)
print(df1)
print(df2)
print(res)

This gives the output as expected.这给出了预期的输出。

<class '__main__.PointInTime'>: comparing 2015-1-1 to 2015-2-1 -> 2015-2-5 (Timespan in PointInTime) gives False
<class '__main__.PointInTime'>: comparing 2015-1-1 to 2015-2-1 -> 2015-4-1 (Timespan in PointInTime) gives False
<class '__main__.PointInTime'>: comparing 2015-1-1 to 2015-2-1 -> 2015-2-5 (Timespan in PointInTime) gives False
<class '__main__.PointInTime'>: comparing 2015-2-2 to 2015-2-1 -> 2015-2-5 (Timespan in PointInTime) gives True
<class '__main__.PointInTime'>: comparing 2015-2-2 to 2015-2-1 -> 2015-4-1 (Timespan in PointInTime) gives True
<class '__main__.PointInTime'>: comparing 2015-2-2 to 2015-2-1 -> 2015-2-5 (Timespan in PointInTime) gives True
<class '__main__.PointInTime'>: comparing 2015-3-3 to 2015-2-1 -> 2015-2-5 (Timespan in PointInTime) gives False
<class '__main__.PointInTime'>: comparing 2015-3-3 to 2015-2-1 -> 2015-4-1 (Timespan in PointInTime) gives True
<class '__main__.PointInTime'>: comparing 2015-3-3 to 2015-2-1 -> 2015-2-5 (Timespan in PointInTime) gives False
<class '__main__.PointInTime'>: comparing 2015-4-4 to 2015-2-1 -> 2015-2-5 (Timespan in PointInTime) gives False
<class '__main__.PointInTime'>: comparing 2015-4-4 to 2015-2-1 -> 2015-4-1 (Timespan in PointInTime) gives False
<class '__main__.PointInTime'>: comparing 2015-4-4 to 2015-2-1 -> 2015-2-5 (Timespan in PointInTime) gives False
        pit  vals1
0  2015-1-1      1
1  2015-2-2      2
2  2015-3-3      3
3  2015-4-4      4
                     ts vals2
0  2015-2-1 -> 2015-2-5     a
1  2015-2-1 -> 2015-4-1     b
2  2015-2-1 -> 2015-2-5     c
        pit vals1                    ts vals2
0  2015-2-2     2  2015-2-1 -> 2015-2-5     a
1  2015-2-2     2  2015-2-1 -> 2015-4-1     b
2  2015-2-2     2  2015-2-1 -> 2015-2-5     c
3  2015-3-3     3  2015-2-1 -> 2015-4-1     b

Probably the overhead of having the classes might have an additional performance loss compared to basic Python types, but I have not looked into that.与基本的 Python 类型相比,拥有类的开销可能会带来额外的性能损失,但我没有对此进行研究。

The above is how we create the "inner" join.以上是我们如何创建“内部”连接。 It should be straightforward to create the "(outer) left", "(outer) right" and "(full) outer" joins.创建“(外)左”、“(外)右”和“(全)外”连接应该很简单。

If the timespans in df_2 are not overlapping, you can use numpy broadcasting to compare the timestamp with all of the timespans and determine which timespan it falls between.如果df_2中的时间跨度不重叠,您可以使用 numpy 广播将时间戳与所有时间跨度进行比较,并确定它位于哪个时间跨度之间。 Then use argmax to figure out which 'Event' to assign (since there can only be at most 1 with non-overlapping timespans).然后使用argmax来确定要分配哪个'Event' (因为最多只能有 1 个不重叠的时间跨度)。

The where condition is used to NaN any that could have fallen outside of all timespans (since argmax won't deal with this properly) where条件用于NaN任何可能超出所有时间跨度的内容(因为argmax无法正确处理此问题)

import numpy as np

m = ((df_1['timestamp'].to_numpy() >= df_2['start'].to_numpy()[:, None])
      & (df_1['timestamp'].to_numpy() <= df_2['end'].to_numpy()[:, None]))

df_1['Event'] = df_2['event'].take(np.argmax(m, axis=0)).where(m.sum(axis=0) > 0)

print(df_1)
            timestamp         A         B Event
0 2016-05-14 10:54:33  0.020228  0.026572    E1
1 2016-05-14 10:54:34  0.057780  0.175499    E2
2 2016-05-14 10:54:35  0.098808  0.620986    E2
3 2016-05-14 10:54:36  0.158789  1.014819    E2
4 2016-05-14 10:54:39  0.038129  2.384590    E3

One option is with the conditional_join from pyjanitor :一种选择是使用pyjanitorconditional_join

# pip install pyjanitor
import pandas as pd
import janitor

(df_1                         
.conditional_join(
          df_2, 
          # variable arguments
          # tuple is of the form:
          # col_from_left_df, col_from_right_df, comparator
          ('timestamp', 'start', '>='), 
          ('timestamp', 'end', '<='),
          how = 'inner',
          sort_by_appearance = False)
.drop(columns=['start', 'end'])
)

            timestamp         A         B event
0 2016-05-14 10:54:33  0.020228  0.026572    E1
1 2016-05-14 10:54:34  0.057780  0.175499    E2
2 2016-05-14 10:54:35  0.098808  0.620986    E2
3 2016-05-14 10:54:36  0.158789  1.014819    E2
4 2016-05-14 10:54:39  0.038129  2.384590    E3

You can decide the join type => left , right , or inner , with the how parameter.您可以使用how参数决定连接类型 => leftrightinner

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何使用 pandas dataframe 连接两个列值在多个列的特定范围内的数据框? - how to join two dataframes for which column values are within a certain range for multiple columns using pandas dataframe? 如何对列值在特定范围内的两个数据框进行外部合并? - How to do outer merge of two dataframes for which column values are within a certain range? 如何连接两个数据帧,其 2 列值在某个 2 个范围 python 内? - How to join two dataframes for which 2 columns values are within a certain 2 ranges python? 按范围和值连接两个数据框 - Join two dataframes by range and values 如何从两个形状相同的Pandas数据框中选择元素位置,且值在一定范围内匹配? - How to select element locations from two Pandas dataframes of identical shape, where the values match within a certain range? 在第二个数据框内的值上连接两个数据框 - Join two dataframes on values within the second dataframe 仅当单独列中的差异在 [-n, +n] 范围内时,才在公共列上加入两个 DataFrame - Join two DataFrames on common columns only if the difference in a separate column is within range [-n, +n] 加入两个数据帧并替换 Python 中的列值 - JOIN two DataFrames and replace Column values in Python 根据列内的值比较两个数据框 - Comparing two dataframes based on values within column 在 Python 的值范围内根据多个条件匹配两个数据帧 - Match two dataframes on multiple criteria within range of values in Python
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM