简体   繁体   中英

Modifying the date index of pandas dataframe

I am trying to write a highly efficient function that would take an average size dataframe (~5000 rows) and return a dataframe with column of the latest year (and same index) such that for each date index of the original dataframe the month containing that date is between some pre-specified start date (st_d) and end date (end_d). I wrote a code where the year is decremented till the month for a particular dateindex is within the desired range. However, it is really slow. For the dataframe with only 366 entries it takes ~0.2s. I need to make it at least an order of magnitude faster so that I can repeatedly apply it to tens of thousands of dataframes. I would very much appreciate any suggestions for this.

import pandas as pd
import numpy as np
import time
from pandas.tseries.offsets import MonthEnd

def year_replace(st_d, end_d, x):

    tmp = time.perf_counter()

    def prior_year(d):
        # 100 is number of the years back, more than enough.
        for i_t in range(100):

            #The month should have been fully seen in one of the data years.
            t_start = pd.to_datetime(str(d.month) + '/' + str(end_d.year - i_t), format="%m/%Y")
            t_end = t_start + MonthEnd(1)
            if t_start <= end_d and t_start >= st_d and t_end <= end_d and t_end >= st_d:
                break
        if i_t < 99:
            return t_start.year
        else:
            raise BadDataException("Not enough data for Gradient Boosted tree.")

    output = pd.Series(index = x.index, data = x.index.map(lambda tt: prior_year(tt)), name = 'year')

    print("time for single dataframe replacement = ", time.perf_counter() - tmp)    

    return output


i = pd.date_range('01-01-2019', '01-01-2020')
x = pd.DataFrame(index = i, data=np.full(len(i), 0))

st_d = pd.to_datetime('01/2016', format="%m/%Y")
end_d = pd.to_datetime('01/2018', format="%m/%Y")
year_replace(st_d, end_d, x)

My advice is: avoid loop whenever you can and check out if an easier way is available.

If I do understand what you aim to do is:

For given start and stop timestamps, find the latest (higher) timestamp t where month is given from index and start <= t <= stop

I believe this can be formalized as follow (I kept your function signature for conveniance):

def f(start, stop, x):
    assert start < stop
    tmp = time.perf_counter()
    def y(d):
        # Check current year:
        if start <= d.replace(day=1, year=stop.year) <= stop:
            return stop.year
        # Check previous year:
        if start <= d.replace(day=1, year=stop.year-1) <= stop:
            return stop.year-1
        # Otherwise fail:
        raise TypeError("Ooops")
    # Apply to index:
    df = pd.Series(index=x.index, data=x.index.map(lambda t: y(t)), name='year')
    print("Tick: ", time.perf_counter() - tmp) 
    return df

It seems to execute faster as requested (almost two decades, we should benchmark to be sure, eg.: with timeit ):

Tick:  0.004744200000004639

There is no need to iterate, you can just check current and previous year. If it fails, it cannot exist a timestamp fulfilling your requirements.

If the day must be kept, then just remove the day=1 in replace method. If you require cut criteria not being equal then modify inequalities accordingly. The following function:

def y(d):
    if start < d.replace(year=stop.year) < stop:
        return stop.year
    if start < d.replace(year=stop.year-1) < stop:
        return stop.year-1
    raise TypeError("Ooops")

Returns the same dataframe as yours.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM