简体   繁体   中英

Vectorising pandas dataframe apply function for user defined function in python

I want to compute week of the month for a specified date. For computing week of the month, I currently use the user-defined function.

Input data frame:

输入数据框

Output data frame:

输出数据框

Here is what I have tried:

from math import ceil
def week_of_month(dt):
    """ 
       Returns the week of the month for the specified date.
    """

    first_day = dt.replace(day=1)

    dom = dt.day
    adjusted_dom = dom + first_day.weekday()

    return int(ceil(adjusted_dom/7.0))

After this,

import pandas as pd

df = pd.read_csv("input_dataframe.csv")
df.date = pd.to_datetime(df.date)
df['year_of_date'] = df.date.dt.year
df['month_of_date'] = df.date.dt.month
df['day_of_date'] = df.date.dt.day


wom = pd.Series()

# worker function for creating week of month series
def convert_date(t):
    global wom
    wom = wom.append(pd.Series(week_of_month(datetime.datetime(t[0],t[1],t[2]))), ignore_index = True)

# calling worker function for each row of dataframe
_ = df[['year_of_date','month_of_date','day_of_date']].apply(convert_date, axis = 1)

# adding new computed column to dataframe
df['week_of_month'] = wom
# here this updated dataframe should look like Output data frame.

What this does is for each row of data frame it computes week of the month using given function. It makes computations slower as the data frame grows to more rows. Because currently I have more than 10M+ rows.

I am looking for a faster way of doing this. What changes can I make to this code to vectorize this operation across all rows?

Thanks in advance.

Edit: What worked for me after reading answers is below code,

first_day_of_month = pd.to_datetime(df.date.values.astype('datetime64[M]'))
df['week_of_month'] = np.ceil((df.date.dt.day + first_day_of_month.weekday) / 7.0).astype(int)

The week_of_month method can be vectorized. It could be beneficial to not do the conversion to datetime objects, and instead use pandas only methods.

first_day_of_month = df.date.to_period("M").to_timestamp()
df["week_of_month"] = np.ceil((data.day + first_day_of_month.weekday) / 7.0).astype(int)

just right off the bat without even going into your code and mentioning X/Y problems, etc.:
try to get a list of unique dates, I'm sure in the 10M rows you have more than one is a duplicate.

Steps:

  1. create a 2nd df that contains only the columns you need and no duplicates (drop_duplicates)
  2. run your function on the small dataframe
  3. merge the large and small dfs
  4. (optional) drop the small one

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM