![](/img/trans.png)
[英]Organize DataFrame into columns by year and index by day-month - PYTHON - PANDAS
[英]Python: Filter DataFrame in Pandas by hour, day and month grouped by year
作為熊貓的新手,我不得不花很多時間才能找到解決該問題的方法。 考慮到我仍然需要解決邊界問題,我想知道一種解決此問題的更好方法。
我有一套從2009年到2012年的10項“動力”的小量指標,並希望獲得所有年份的小時和日/月窗口(即按年份,按小時,日和月分組的過濾器)。
我得出的結論如下:
import pandas as pd
import numpy as np
import datetime
dates = pd.date_range(start="08/01/2009",end="08/01/2012",freq="10min")
df = pd.DataFrame(np.random.rand(len(dates), 1)*1500, index=dates, columns=['Power'])
def filter(df, day, month, hour, daysWindow, hoursWindow):
"""
Filter a Dataframe by a date window and hour window grouped by years
@type df: DataFrame
@param df: DataFrame with dates and values
@type day: int
@param day: Day to focus on
@type month: int
@param month: Month to focus on
@type hour: int
@param hour: Hour to focus on
@type daysWindow: int
@param daysWindow: Number of days to perform the days window selection
@type hourWindow: int
@param hourWindow: Number of hours to perform the hours window selection
@rtype: DataFrame
@return: Returns a DataFrame with the
"""
df_filtered = None
grouped = df.groupby(lambda x : x.year)
for year, groupYear in grouped:
groupedMonthDay = groupYear.groupby(lambda x : (x.month, x.day))
for monthDay, groupMonthDay in groupedMonthDay:
if monthDay >= (month,day - daysWindow) and monthDay <= (month,day + daysWindow):
new_df = groupMonthDay.ix[groupMonthDay.index.indexer_between_time(datetime.time(hour - hoursWindow), datetime.time(hour + hoursWindow))]
if df_filtered is None:
df_filtered = new_df
else:
df_filtered = df_filtered.append(new_df)
return df_filtered
df_filtered = filter(df,day=8, month=10, hour=8, daysWindow=1, hoursWindow=1)
print len(df)
print len(df_filtered)
返回作為輸出:
>>>
157825
117
當然,在選擇像1和hoursWindow 2這樣的小時時,此代碼在邊界問題方面需要改進。即:
>>> filter(df,day=8, month=10, hour=1, daysWindow=1, hoursWindow=2)
Traceback (most recent call last):
File "<interactive input>", line 1, in <module>
File "D:\tmp\test_filtro.py", line 40, in filter
new_df = groupMonthDay.ix[groupMonthDay.index.indexer_between_time(datetime.time(hour - hoursWindow), datetime.time(hour + hoursWindow))]
ValueError: hour must be in 0..23
選擇1或30之類的日期時也會發生類似的問題。
如何改進此代碼?
filter
功能的更新代碼可確保沒有邊界問題:
import pandas as pd
import numpy as np
import datetime
dates = pd.date_range(start="08/01/2009",end="08/01/2012",freq="10min")
df = pd.DataFrame(np.random.rand(len(dates), 1)*1500, index=dates, columns=['Power'])
def filter(df, day, month, hour, minute=0, daysWindow=1, hoursWindow=1):
"""
Filter a Dataframe by a date window and hour window grouped by years
@type df: DataFrame
@param df: DataFrame with dates and values
@type day: int
@param day: Day to focus on
@type month: int
@param month: Month to focus on
@type hour: int
@param hour: Hour to focus on
@type daysWindow: int
@param daysWindow: Number of days to perform the days window selection
@type hoursWindow: int
@param hourWindow: Number of hours to perform the hours window selection
@rtype: DataFrame
@return: Returns a DataFrame with the
"""
df_filtered = None
grouped = df.groupby(lambda x : x.year)
for year, groupYear in grouped:
date = datetime.date(year, month, day)
dateStart = date - datetime.timedelta(days=daysWindow)
dateEnd = date + datetime.timedelta(days=daysWindow+1)
df_filtered_days = df[dateStart:dateEnd]
timeStart = datetime.time(0 if hour-hoursWindow < 0 else hour-hoursWindow, minute)
timeEnd = datetime.time(23 if hour+hoursWindow > 23 else hour+hoursWindow, minute)
new_df = df_filtered_days.ix[df_filtered_days.index.indexer_between_time(timeStart, timeEnd)]
if df_filtered is None:
df_filtered = new_df
else:
df_filtered = df_filtered.append(new_df)
return df_filtered
df_filtered = filter(df,day=8, month=10, hour=1, daysWindow=1, hoursWindow=2)
print len(df)
print len(df_filtered)
輸出為:
>>>
157825
174
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.