简体   繁体   中英

Vectorize for-loop which combines two different dataframes

I have a two-columns dataframe that contains the input data. The first column is a starting date and the second is called duration (minutes). You can think of a machine that runs from start to start+duration. I want to use this information to build an 1d-array of length 8760*60 that contains all the minutes of the year and there should be a 1 at the place where the machine runs and a zero otherwise. The following MWE does the task, but is slow because of the for-loop and I would not know how to vectorize it.

import pandas as pd
import numpy as np

# Start and end of time horizon
start = pd.Timestamp(year=2019, month=1, day=1, hour=0, tz='UTC')
end = pd.Timestamp(year=2019, month=12, day=31, hour=23, minute=59, tz='UTC')

# DataFrame of time horizon
dates = pd.DataFrame(pd.date_range(start, end, freq='min'))

# Starting points
t1 = pd.Timestamp(year=2019, month=1, day=2, hour=0, tz='UTC')
t2 = pd.Timestamp(year=2019, month=1, day=1, hour=0, minute=3, tz='UTC')

# Durations
d1 = 5
d2 = 30

# DataFrame from input data
data = pd.DataFrame(
    data=[
        [t1, d1],
        [t2, d2],
        ],
    columns=[
        'start',
        'duration',
        ]
    )

# Array to be filled
on = np.zeros(8760*60)

# loop over data rows 
for idx in data.index:
    # Start for on array from dates 
    start = dates[dates[0] == data.loc[idx, 'start']].index[0]
    
    # Duration from data
    duration = data.loc[idx, 'duration']

    # Put 1s in the on array from start to start+duration
    on[start: start+duration] = 1

Does this work for you:

idx = pd.date_range(pd.Timestamp('2019-01-01', tz='UTC'),
                    pd.Timestamp('2019-12-31', tz='UTC'),
                    freq='1min')
df = pd.DataFrame({'on': 0}, index=idx)

def to_mins(row):
    return set(pd.date_range(row['start'], periods=row['duration'], freq='1min'))

idx_on = set().union(*data[['start', 'duration']].apply(to_mins, axis='columns'))
df.loc[idx_on] = 1
on = df.on.values

If it's possible that the duration leads to timestamps outside of 2019, you could use:

def to_min_range(row):
    return set(
        m 
        for m in pd.date_range(row['start'], periods=row['duration'], freq='1min')
        if m.year == 2019
    )
idx = pd.date_range(start, end, freq="min").to_series()
data["duration"] = pd.to_timedelta(data.duration, unit="min")
res = pd.concat(
    [
        i.to_series()
        for i in data.apply(
            lambda row: pd.date_range(row[0], row.sum(), freq="min", closed="right"),
            axis=1,
        )
    ]
)
_, res = idx.align(res)
on = (~res.isna()).astype("int").values

This also uses apply len(data) times, hopefully that's a much smaller number than the number of minutes in a year.

EDIT:

The time intervals are closed on both ends. In the comments the OP mentioned they expect the sum to match the total duration the equipment was on, this can be achieved by making one of the ends of the interval open. Since the sum is expected to be the duration it's on, I chose right .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM