简体   繁体   中英

Select nearest date first day of month in a python dataframe

i have this kind of dataframe
在此处输入图像描述

These data represents the value of an consumption index generally encoded once a month (at the end or at the beginning of the following month) but sometimes more. This value can be resetted to "0" if the counter is out and be replaced. Moreover some month no data is available.

I would like select only one entry per month but this entry has to be the nearest to the first day of the month AND inferior to the 15th day of the month (because if the day is higher it could be the measure of the end of the month). Another condition is that if the difference between two values is negative (the counter has been replaced), the value need to be kept even if the date is not the nearest day near the first day of month.

For example, the output data need to be
在此处输入图像描述

The purpose is to calculate only a consumption per month.

A solution is to parse the dataframe (as a array) and perform some if conditions statements. However i wonder if there is "simple" alternative to achieve that.

Thank you

You can normalize the month data with MonthEnd and then drop duplicates based off that column and keep the last value.

from pandas.tseries.offsets import MonthEnd    
df.New = df.Index + MonthEnd(1)
df.Diff = abs((df.Index - df.New).dt.days)
df = df.sort_values(df.New, df.Diff)
df = df.drop_duplicates(subset='New', keep='first').drop(['New','Diff'], axis=1)

That should do the trick, but I was not able to test, so please copy and past the sample data into StackOverFlow if this isn't doing the job.

Defining dataframe, converting index to datetime, defining helper columns, using them to run shift method to conditionally remove rows, and finally removing the helper columns:

from pandas.tseries.offsets import MonthEnd, MonthBegin
import pandas as pd
from datetime import datetime as dt
import numpy as np

df = pd.DataFrame([
    [1254],
    [1265],
    [1277],
    [1301],
    [1345],
    [1541]
], columns=["Value"]
, index=[dt.strptime("05-10-19", '%d-%m-%y'),
         dt.strptime("29-10-19", '%d-%m-%y'),
         dt.strptime("30-10-19", '%d-%m-%y'),
         dt.strptime("04-11-19", '%d-%m-%y'),
         dt.strptime("30-11-19", '%d-%m-%y'),
         dt.strptime("03-02-20", '%d-%m-%y')
         ]
)

early_days = df.loc[df.index.day < 15]
early_month_end = early_days.index - MonthEnd(1)
early_day_diff = early_days.index - early_month_end
late_days = df.loc[df.index.day >= 15]
late_month_end = late_days.index + MonthBegin(1)
late_day_diff = late_month_end - late_days.index
df["day_offset"] = (early_day_diff.append(late_day_diff) / np.timedelta64(1, 'D')).astype(int)
df["start_of_month"] = df.index.day < 15
df["month"] = df.index.values.astype('M8[D]').astype(str)
df["month"] = df["month"].str[5:7].str.lstrip('0')
# df["month_diff"] = df["month"].astype(int).diff().fillna(0).astype(int)
df = df[df["month"].shift().ne(df["month"].shift(-1))]
df = df.drop(columns=["day_offset", "start_of_month", "month"])
print(df)

Returns:

            Value
2019-10-05   1254
2019-10-30   1277
2019-11-04   1301
2019-11-30   1345
2020-02-03   1541

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM