简体   繁体   中英

get non-overlapping period from 2 dataframe with date ranges

I'm working on a billing system.

On the one hand, I have contracts with start and end date, which I need to bill monthly. One contract can have several start/end dates, but they can't overlap for a same contract.

On the other hand, I have a df with the invoice billed per contract, with their start and end date. Invoices' start/end dates for a specific contract can't also overlap. There could be gap though between end date of an invoice and start of another invoice.

My goal is to look at the contract start/end dates, and remove all the period billed for a single contract, so that I know what's left to be billed.

Here is my data for contract:

contract_df = pd.DataFrame({'contract_id': {0: 'C00770052',
  1: 'C00770052',
  2: 'C00770052',
  3: 'C00770052',
  4: 'C00770053'},
 'from': {0: pd.to_datetime('2018-07-01 00:00:00'),
  1: pd.to_datetime('2019-01-01 00:00:00'),
  2: pd.to_datetime('2019-07-01 00:00:00'),
  3: pd.to_datetime('2019-09-01 00:00:00'),
  4: pd.to_datetime('2019-10-01 00:00:00')},
 'to': {0: pd.to_datetime('2019-01-01 00:00:00'),
  1: pd.to_datetime('2019-07-01 00:00:00'),
  2: pd.to_datetime('2019-09-01 00:00:00'),
  3: pd.to_datetime('2021-01-01 00:00:00'),
  4: pd.to_datetime('2024-01-01 00:00:00')}})

契约df

Here is my invoice data (no invoice for C00770053):

 invoice_df = pd.DataFrame({'contract_id': {0: 'C00770052',
  1: 'C00770052',
  2: 'C00770052',
  3: 'C00770052',
  4: 'C00770052',
  5: 'C00770052',
  6: 'C00770052',
  7: 'C00770052'},
 'from': {0: pd.to_datetime('2018-07-01 00:00:00'),
  1: pd.to_datetime('2018-08-01 00:00:00'),
  2: pd.to_datetime('2018-09-01 00:00:00'),
  3: pd.to_datetime('2018-10-01 00:00:00'),
  4: pd.to_datetime('2018-11-01 00:00:00'),
  5: pd.to_datetime('2019-05-01 00:00:00'),
  6: pd.to_datetime('2019-06-01 00:00:00'),
  7: pd.to_datetime('2019-07-01 00:00:00')},
 'to': {0: pd.to_datetime('2018-08-01 00:00:00'),
  1: pd.to_datetime('2018-09-01 00:00:00'),
  2: pd.to_datetime('2018-10-01 00:00:00'),
  3: pd.to_datetime('2018-11-01 00:00:00'),
  4: pd.to_datetime('2019-04-01 00:00:00'),
  5: pd.to_datetime('2019-06-01 00:00:00'),
  6: pd.to_datetime('2019-07-01 00:00:00'),
  7: pd.to_datetime('2019-09-01 00:00:00')}})

发票df

My expected result is:

to_bill_df = pd.DataFrame({'contract_id': {0: 'C00770052',
  1: 'C00770052',
  2: 'C00770053'},
 'from': {0: pd.to_datetime('2019-04-01 00:00:00'),
  1: pd.to_datetime('2019-09-01 00:00:00'),
  2: pd.to_datetime('2019-10-01 00:00:00')},
 'to': {0: pd.to_datetime('2019-05-01 00:00:00'),
  1: pd.to_datetime('2021-01-01 00:00:00'),
  2: pd.to_datetime('2024-01-01 00:00:00')}})

账单

What I need therefore is to go through each row of contract_df, identify the invoices matching the relevant period and remove the periods which have already been billed from the contract_df, eventually splitting the contract_df row into 2 rows if there is a gap.

The problem is that going like this seem very heavy considering that I'll have millions of invoices and contracts, I feel like there is an easy way with pandas but I'm not sure how I could do it

Thanks

I was solving a similar problem the other day. It's not a simple solution but should be generic in identifying any non-overlapping intervals.

The idea is to convert your dates into continuous integers and then we can remove the overlap with a set OR operator. The function below will transform your DataFrame into a dictionary that contains a list of non-overlapping integer dates for each ID.

from functools import reduce

def non_overlapping_intervals(df, uid, date_from, date_to):
    # Convert date to day integer
    helper_from = date_from + '_helper'
    helper_to = date_to + '_helper'
    df[helper_from] = df[date_from].sub(pd.Timestamp('1900-01-01')).dt.days  # set a reference date
    df[helper_to] = df[date_to].sub(pd.Timestamp('1900-01-01')).dt.days

    out = (
        df[[uid, helper_from, helper_to]]
        .dropna()
        .groupby(uid)
        [[helper_from, helper_to]]
        .apply(
            lambda x: reduce(  # Apply for an arbitrary number of cases
                lambda a, b: a | b, x.apply(  # Eliminate the overlapping dates OR operation on set
                    lambda y: set(range(y[helper_from], y[helper_to])), # Create continuous integers for date ranges
                    axis=1
                )
            )
        )
        .to_dict()
    )
    return out

From here, we want to do a set subtraction to find the dates and IDs for which there are contracts but no invoices:

from collections import defaultdict

invoice_dates = defaultdict(set, non_overlapping_intervals(invoice_df, 'contract_id', 'from', 'to'))
contract_dates = defaultdict(set, non_overlapping_intervals(contract_df, 'contract_id', 'from', 'to'))

missing_dates = {}
for k, v in contract_dates.items():
    missing_dates[k] = list(v - invoice_dates.get(k, set()))

Now we have a dict called missing_dates that gives us each date for which there are no invoices. To convert it into your output format, we need to separate each continuous group for each ID. Using this answer , we arrive at the below:

from itertools import groupby
from operator import itemgetter

missing_invoices = []
for uid, dates in missing_dates.items():
    for k, g in groupby(enumerate(sorted(dates)), lambda x: x[0] - x[1]):
        group = list(map(int, map(itemgetter(1), g)))
        missing_invoices.append([uid, group[0], group[-1]])
missing_invoices = pd.DataFrame(missing_invoices, columns=['contract_id', 'from', 'to'])

# Convert back to datetime
missing_invoices['from'] = missing_invoices['from'].apply(lambda x: pd.Timestamp('1900-01-01') + pd.DateOffset(days=x))
missing_invoices['to'] = missing_invoices['to'].apply(lambda x: pd.Timestamp('1900-01-01') + pd.DateOffset(days=x + 1))

Probably not the simple solution you were looking for, but this should be reasonably efficient.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM