从 2 dataframe 获取具有日期范围的非重叠时段

Question

I'm working on a billing system.我正在开发一个计费系统。

On the one hand, I have contracts with start and end date, which I need to bill monthly.一方面，我有开始和结束日期的合同，我需要按月计费。 One contract can have several start/end dates, but they can't overlap for a same contract.一份合同可以有多个开始/结束日期，但同一合同不能重叠。

On the other hand, I have a df with the invoice billed per contract, with their start and end date.另一方面，我有一个 df，其中包含按合同计费的发票，以及它们的开始和结束日期。 Invoices' start/end dates for a specific contract can't also overlap.特定合同的发票开始/结束日期也不能重叠。 There could be gap though between end date of an invoice and start of another invoice.尽管发票的结束日期和另一张发票的开始日期之间可能存在差距。

My goal is to look at the contract start/end dates, and remove all the period billed for a single contract, so that I know what's left to be billed.我的目标是查看合同开始/结束日期，并删除为单个合同计费的所有期间，以便我知道还有什么需要计费。

Here is my data for contract:这是我的合同数据：

contract_df = pd.DataFrame({'contract_id': {0: 'C00770052',
  1: 'C00770052',
  2: 'C00770052',
  3: 'C00770052',
  4: 'C00770053'},
 'from': {0: pd.to_datetime('2018-07-01 00:00:00'),
  1: pd.to_datetime('2019-01-01 00:00:00'),
  2: pd.to_datetime('2019-07-01 00:00:00'),
  3: pd.to_datetime('2019-09-01 00:00:00'),
  4: pd.to_datetime('2019-10-01 00:00:00')},
 'to': {0: pd.to_datetime('2019-01-01 00:00:00'),
  1: pd.to_datetime('2019-07-01 00:00:00'),
  2: pd.to_datetime('2019-09-01 00:00:00'),
  3: pd.to_datetime('2021-01-01 00:00:00'),
  4: pd.to_datetime('2024-01-01 00:00:00')}})

Here is my invoice data (no invoice for C00770053):这是我的发票数据（C00770053 没有发票）：

 invoice_df = pd.DataFrame({'contract_id': {0: 'C00770052',
  1: 'C00770052',
  2: 'C00770052',
  3: 'C00770052',
  4: 'C00770052',
  5: 'C00770052',
  6: 'C00770052',
  7: 'C00770052'},
 'from': {0: pd.to_datetime('2018-07-01 00:00:00'),
  1: pd.to_datetime('2018-08-01 00:00:00'),
  2: pd.to_datetime('2018-09-01 00:00:00'),
  3: pd.to_datetime('2018-10-01 00:00:00'),
  4: pd.to_datetime('2018-11-01 00:00:00'),
  5: pd.to_datetime('2019-05-01 00:00:00'),
  6: pd.to_datetime('2019-06-01 00:00:00'),
  7: pd.to_datetime('2019-07-01 00:00:00')},
 'to': {0: pd.to_datetime('2018-08-01 00:00:00'),
  1: pd.to_datetime('2018-09-01 00:00:00'),
  2: pd.to_datetime('2018-10-01 00:00:00'),
  3: pd.to_datetime('2018-11-01 00:00:00'),
  4: pd.to_datetime('2019-04-01 00:00:00'),
  5: pd.to_datetime('2019-06-01 00:00:00'),
  6: pd.to_datetime('2019-07-01 00:00:00'),
  7: pd.to_datetime('2019-09-01 00:00:00')}})

My expected result is:我的预期结果是：

to_bill_df = pd.DataFrame({'contract_id': {0: 'C00770052',
  1: 'C00770052',
  2: 'C00770053'},
 'from': {0: pd.to_datetime('2019-04-01 00:00:00'),
  1: pd.to_datetime('2019-09-01 00:00:00'),
  2: pd.to_datetime('2019-10-01 00:00:00')},
 'to': {0: pd.to_datetime('2019-05-01 00:00:00'),
  1: pd.to_datetime('2021-01-01 00:00:00'),
  2: pd.to_datetime('2024-01-01 00:00:00')}})

What I need therefore is to go through each row of contract_df, identify the invoices matching the relevant period and remove the periods which have already been billed from the contract_df, eventually splitting the contract_df row into 2 rows if there is a gap.因此，我需要的是 go 通过contract_df的每一行，识别与相关期间匹配的发票并从contract_df中删除已经计费的期间，如果有差距，最终将contract_df行分成2行。

The problem is that going like this seem very heavy considering that I'll have millions of invoices and contracts, I feel like there is an easy way with pandas but I'm not sure how I could do it问题是，考虑到我将拥有数百万张发票和合同，这样的做法似乎非常繁重，我觉得 pandas 有一种简单的方法，但我不确定我该怎么做

Thanks谢谢

Answer 1

I was solving a similar problem the other day.前几天我正在解决一个类似的问题。 It's not a simple solution but should be generic in identifying any non-overlapping intervals.这不是一个简单的解决方案，但在识别任何非重叠间隔时应该是通用的。

The idea is to convert your dates into continuous integers and then we can remove the overlap with a set OR operator.这个想法是将您的日期转换为连续整数，然后我们可以使用集合 OR 运算符删除重叠。 The function below will transform your DataFrame into a dictionary that contains a list of non-overlapping integer dates for each ID.下面的 function 会将您的 DataFrame 转换为包含每个 ID 的非重叠 integer 日期列表的字典。

from functools import reduce

def non_overlapping_intervals(df, uid, date_from, date_to):
    # Convert date to day integer
    helper_from = date_from + '_helper'
    helper_to = date_to + '_helper'
    df[helper_from] = df[date_from].sub(pd.Timestamp('1900-01-01')).dt.days  # set a reference date
    df[helper_to] = df[date_to].sub(pd.Timestamp('1900-01-01')).dt.days

    out = (
        df[[uid, helper_from, helper_to]]
        .dropna()
        .groupby(uid)
        [[helper_from, helper_to]]
        .apply(
            lambda x: reduce(  # Apply for an arbitrary number of cases
                lambda a, b: a | b, x.apply(  # Eliminate the overlapping dates OR operation on set
                    lambda y: set(range(y[helper_from], y[helper_to])), # Create continuous integers for date ranges
                    axis=1
                )
            )
        )
        .to_dict()
    )
    return out

From here, we want to do a set subtraction to find the dates and IDs for which there are contracts but no invoices:从这里开始，我们想做一组减法来找出有合同但没有发票的日期和 ID：

from collections import defaultdict

invoice_dates = defaultdict(set, non_overlapping_intervals(invoice_df, 'contract_id', 'from', 'to'))
contract_dates = defaultdict(set, non_overlapping_intervals(contract_df, 'contract_id', 'from', 'to'))

missing_dates = {}
for k, v in contract_dates.items():
    missing_dates[k] = list(v - invoice_dates.get(k, set()))

Now we have a dict called missing_dates that gives us each date for which there are no invoices.现在我们有一个名为missing_dates的字典，它为我们提供了没有发票的每个日期。 To convert it into your output format, we need to separate each continuous group for each ID.要将其转换为您的 output 格式，我们需要为每个 ID 分隔每个连续组。 Using this answer , we arrive at the below:使用这个答案，我们得出以下结论：

from itertools import groupby
from operator import itemgetter

missing_invoices = []
for uid, dates in missing_dates.items():
    for k, g in groupby(enumerate(sorted(dates)), lambda x: x[0] - x[1]):
        group = list(map(int, map(itemgetter(1), g)))
        missing_invoices.append([uid, group[0], group[-1]])
missing_invoices = pd.DataFrame(missing_invoices, columns=['contract_id', 'from', 'to'])

# Convert back to datetime
missing_invoices['from'] = missing_invoices['from'].apply(lambda x: pd.Timestamp('1900-01-01') + pd.DateOffset(days=x))
missing_invoices['to'] = missing_invoices['to'].apply(lambda x: pd.Timestamp('1900-01-01') + pd.DateOffset(days=x + 1))

Probably not the simple solution you were looking for, but this should be reasonably efficient.可能不是您正在寻找的简单解决方案，但这应该是相当有效的。

从 2 dataframe 获取具有日期范围的非重叠时段

问题描述

1 个解决方案

解决方案1
0 已采纳 2019-10-15 10:15:50

从 2 dataframe 获取具有日期范围的非重叠时段

问题描述

1 个解决方案

解决方案1 0 已采纳 2019-10-15 10:15:50

解决方案1
0 已采纳 2019-10-15 10:15:50