简体   繁体   English

有没有办法合并间隔索引和 pandas 中的另一个列值?

[英]Is there a way to merge on Interval Index and another Column Value in pandas?

So I currently have 2 dataframes.所以我目前有2个数据框。 These have different columns and what I have been trying to figure out is how to merge on an interval index as well as a unique ID value.这些有不同的列,我一直试图弄清楚的是如何合并一个间隔索引以及一个唯一的 ID 值。 Below are 2 different examples of the dataframes I have:以下是我拥有的数据框的 2 个不同示例:

UniqueID,Start_Date,End_Date
ID1,01-01-2020,01-08-2020
ID2,01-02-2020,01-04-2020
ID3,01-03-2020,01-05-2020
ID4,01-04-2020,01-09-2020
ID5,01-05-2020,01-10-2020
ID6,01-06-2020,01-11-2020

Creating the dataframe:创建 dataframe:

pd.DataFrame({
    'UniqueId': ['ID1','ID2','ID3','ID4','ID5','ID6'],
    'Start_Date': ['01-01-2020','01-02-2020','01-03-2020','01-04-2020','01-05-2020','01-06-2020'],
    'End_Date': ['01-08-2020','01-04-2020','01-05-2020','01-09-2020','01-10-2020','01-11-2020']
})

UniqueID,Trip_Date,Value
ID1,10-02-2020,1
ID1,15-02-2020,207
ID2,06-03-2020,10
ID3,29-01-2022,15
ID9,15-02-2020,207
ID12,19-06-2021,189

Creating the dataframe:创建 dataframe:

pd.DataFrame({
    'UniqueId': ['ID1','ID1','ID2','ID3','ID9','ID12'],
    'Trip_Date': ['10-02-2020','15-02-2020','06-03-2020','29-01-2022','15-02-2020','19-06-2021'],
    'Value': ['1','207','10','15','207','189']
})

What I want to do is to be able to merge on the UniqueID as well as the interval of the start date and end date inclusively.我想要做的是能够合并 UniqueID 以及包含开始日期和结束日期的间隔。 The resultant dataframe would look like the one below:生成的 dataframe 如下所示:

UniqueID,Start_Date,End_Date,Trip_Date,Value
ID1,01-01-2020,01-08-2020,10-02-2020,1
ID1,01-01-2020,01-08-2020,15-02-2020,207
ID2,01-02-2020,01-04-2020,06-03-2020,10
ID3,01-03-2020,01-05-2020,NA,NA
ID4,01-04-2020,01-09-2020,NA,NA
ID5,01-05-2020,01-10-2020,NA,NA
ID6,01-06-2020,01-11-2020,NA,NA

df2.merge(df1, how='left', on='UniqueID')

The first method I have thought of using is to use an IntervalIndex on df1 and then merge based off that but then I have the issue of not being able to merge on the UniqueID and vice versa with UniqueID as a merge column.我想到的第一种方法是在 df1 上使用 IntervalIndex,然后基于它进行合并,但是我遇到了无法在 UniqueID 上合并的问题,反之亦然,将 UniqueID 作为合并列。 I kept with a left join when I merged df2 with df1 in order to preserve the original dataframe while merging any records of df1 with potential "matches" on df2.当我将 df2 与 df1 合并时,我保留了左连接,以保留原始 dataframe,同时将 df1 的任何记录与 df2 上的潜在“匹配”合并。

I thought of potentially using a MultiIndex with an IntervalIndex as one of the levels and then the UniqueID as another but wasn't sure how to go about this?我想过可能使用带有 IntervalIndex 的 MultiIndex 作为级别之一,然后使用 UniqueID 作为另一个级别,但不确定如何 go 解决这个问题? Any ideas would be greatly appreciated!任何想法将不胜感激!

The code below, should allow you to get the dataframes into a pandas df.下面的代码应该允许您将数据帧放入 pandas df。 Just make sure to copy and reassign.只需确保复制并重新分配即可。

df = pd.read_clipboard(sep=',')
df1 = df.copy()

You can merge dataframes on two columns.您可以合并两列上的数据框。 So if you calculate intervals in each dataframe, you can match on 'UniqueID' and 'Interval'.因此,如果您计算每个 dataframe 中的间隔,则可以匹配“UniqueID”和“Interval”。 See for instance this post: pandas: merge (join) two data frames on multiple columns .例如,参见这篇文章: pandas: merge (join) two data frames on multiple columns

Merge your dataframe on your UniqueID column then check if Trip_Date is between Start_Date and End_date .将您的 dataframe 合并到您的UniqueID列上,然后检查Trip_Date是否介于Start_DateEnd_date之间。 Finally, set to nan all rows where the condition is not met:最后,将不满足条件的所有行设置为nan

# Only if this columns have not datetime64 dtype
df1['Start_Date'] = pd.to_datetime(df1['Start_Date'], dayfirst=True)
df1['End_Date'] = pd.to_datetime(df1['End_Date'], dayfirst=True)
df2['Trip_Date'] = pd.to_datetime(df2['Trip_Date'], dayfirst=True)

out = pd.merge(df1, df2, on='UniqueID', how='left')
m = out['Trip_Date'].between(out['Start_Date'], out['End_Date'])

out.loc[~m, ['Trip_Date', 'Value']] = np.NaN

Output: Output:

>>> out
  UniqueID Start_Date   End_Date  Trip_Date  Value
0      ID1 2020-01-01 2020-08-01 2020-02-10    1.0
1      ID1 2020-01-01 2020-08-01 2020-02-15  207.0
2      ID2 2020-02-01 2020-04-01 2020-03-06   10.0
3      ID3 2020-03-01 2020-05-01        NaT    NaN
4      ID4 2020-04-01 2020-09-01        NaT    NaN
5      ID5 2020-05-01 2020-10-01        NaT    NaN
6      ID6 2020-06-01 2020-11-01        NaT    NaN
import pandas as pd

df1 = pd.read_csv("df1.csv")
df2 = pd.read_csv("df2.csv")
new_df = pd.merge(df1, df2,  how='left',)
print(new_df)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM