简体   繁体   English

如何在熊猫 date_range 方法中包含结束日期?

[英]How to include end date in pandas date_range method?

From pd.date_range('2016-01', '2016-05', freq='M', ).strftime('%Y-%m') , the last month is 2016-04 , but I was expecting it to be 2016-05 .pd.date_range('2016-01', '2016-05', freq='M', ).strftime('%Y-%m') ,上个月是2016-04 ,但我期待它是2016-05 It seems to me this function is behaving like the range method, where the end parameter is not included in the returning array.在我看来,这个函数的行为类似于range方法,其中 end 参数不包含在返回数组中。

Is there a way to get the end month included in the returning array, without processing the string for the end month?有没有办法在不处理结束月份的字符串的情况下获取包含在返回数组中的结束月份?

A way to do it without messing with figuring out month ends yourself.一种不费吹灰之力搞清楚月份结束的方法。

pd.date_range(*(pd.to_datetime(['2016-01', '2016-05']) + pd.offsets.MonthEnd()), freq='M')

DatetimeIndex(['2016-01-31', '2016-02-29', '2016-03-31', '2016-04-30',
           '2016-05-31'],
          dtype='datetime64[ns]', freq='M')

You can use .union to add the next logical value after initializing the date_range .您可以在初始化date_range后使用.union添加下一个逻辑值。 It should work as written for any frequency:它应该在任何频率下都可以正常工作:

d = pd.date_range('2016-01', '2016-05', freq='M')
d = d.union([d[-1] + 1]).strftime('%Y-%m')

Alternatively, you can use period_range instead of date_range .或者,您可以使用period_range而不是date_range Depending on what you intend to do, this might not be the right thing to use, but it satisfies your question:根据您打算做什么,这可能不是正确的使用方法,但它可以满足您的问题:

pd.period_range('2016-01', '2016-05', freq='M').strftime('%Y-%m')

In either case, the resulting output is as expected:在任何一种情况下,结果输出都符合预期:

['2016-01' '2016-02' '2016-03' '2016-04' '2016-05']

For the later crowd.对于后来的人群。 You can also try to use the Month-Start frequency.您也可以尝试使用 Month-Start 频率。

>>> pd.date_range('2016-01', '2016-05', freq='MS', format = "%Y-%m" )
DatetimeIndex(['2016-01-01', '2016-02-01', '2016-03-01', '2016-04-01',
               '2016-05-01'],
              dtype='datetime64[ns]', freq='MS')

Include the day when specifying the dates in date_range calldate_range调用中指定日期时包括日期

pd.date_range('2016-01-31', '2016-05-31', freq='M', ).strftime('%Y-%m')

array(['2016-01', '2016-02', '2016-03', '2016-04', '2016-05'], 
      dtype='|S7')

I had a similar problem when using datetime objects in dataframe.在数据框中使用日期时间对象时,我遇到了类似的问题。 I would set the boundaries through .min() and .max() functions and then fill in missing dates using the pd.date_range function.我会通过 .min() 和 .max() 函数设置边界,然后使用 pd.date_range 函数填充缺失的日期。 Unfortunately the returned list/df was missing the maximum value.不幸的是,返回的 list/df 缺少最大值。

I found two work arounds for this:我为此找到了两个解决方法:

1) Add "closed = None" parameter in the pd.date_range function. 1) 在 pd.date_range 函数中添加“closed = None”参数。 This worked in the example below;这在下面的示例中起作用; however, it didn't work for me when working only with dataframes (no idea why).但是,仅在使用数据帧时它对我不起作用(不知道为什么)。

2) If option #1 doesn't work then you can add one extra unit (in this case a day) using the datetime.timedelta() function. 2) 如果选项 #1 不起作用,那么您可以使用 datetime.timedelta() 函数添加一个额外的单位(在这种情况下是一天)。 In the case below it over indexed by a day but it can work for you if the date_range function isn't giving you the full range.在下面的情况下,它超过了一天的索引,但如果 date_range 函数没有给你完整的范围,它可以为你工作。

import pandas as pd
import datetime as dt 

#List of dates as strings
time_series = ['2020-01-01', '2020-01-03', '2020-01-5', '2020-01-6', '2020-01-7']

#Creates dataframe with time data that is converted to datetime object 
raw_data_df = pd.DataFrame(pd.to_datetime(time_series), columns = ['Raw_Time_Series'])

#Creates an indexed_time list that includes missing dates and the full time range

#Option No. 1 is to use the closed = None parameter choice. 
indexed_time = pd.date_range(start = raw_data_df.Raw_Time_Series.min(),end = raw_data_df.Raw_Time_Series.max(),freq='D',closed= None)
print('indexed_time option #! = ', indexed_time)

#Option No. 2 if the function allows you to extend the time by one unit (in this case day) 
#by using the datetime.timedelta function to get what you need. 
indexed_time = pd.date_range(start = raw_data_df.Raw_Time_Series.min(),end = raw_data_df.Raw_Time_Series.max()+dt.timedelta(days=1),freq='D')
print('indexed_time option #2 = ', indexed_time)

#In this case you over index by an extra day because the date_range function works properly
#However, if the "closed = none" parameters doesn't extend through the full range then this is a good work around 

I dont think so.我不这么认为。 You need to add the (n+1) boundary您需要添加 (n+1) 边界

   pd.date_range('2016-01', '2016-06', freq='M' ).strftime('%Y-%m')

The start and end dates are strictly inclusive.开始和结束日期严格包括在内。 So it will not generate any dates outside of those dates if specified.因此,如果指定,它不会生成这些日期之外的任何日期。 http://pandas.pydata.org/pandas-docs/stable/timeseries.html http://pandas.pydata.org/pandas-docs/stable/timeseries.html

Either way, you have to manually add some information.无论哪种方式,您都必须手动添加一些信息。 I believe adding just one more month is not a lot of work.我相信再增加一个月的工作量并不大。

The explanation for this issue is that the function pd.to_datetime() converts a '%Y-%m' date string by default to the first of the month datetime, or '%Y-%m-01' :这个问题的解释是函数pd.to_datetime()默认将'%Y-%m'日期字符串转换为月份的第一天日期时间,或'%Y-%m-01'

>>> pd.to_datetime('2016-05')
Timestamp('2016-05-01 00:00:00')
>>> pd.date_range('2016-01', '2016-02')
DatetimeIndex(['2016-01-01', '2016-01-02', '2016-01-03', '2016-01-04',
               '2016-01-05', '2016-01-06', '2016-01-07', '2016-01-08',
               '2016-01-09', '2016-01-10', '2016-01-11', '2016-01-12',
               '2016-01-13', '2016-01-14', '2016-01-15', '2016-01-16',
               '2016-01-17', '2016-01-18', '2016-01-19', '2016-01-20',
               '2016-01-21', '2016-01-22', '2016-01-23', '2016-01-24',
               '2016-01-25', '2016-01-26', '2016-01-27', '2016-01-28',
               '2016-01-29', '2016-01-30', '2016-01-31', '2016-02-01'],
              dtype='datetime64[ns]', freq='D')

Then everything follows from that.然后一切都由此而来。 Specifying freq='M' includes month ends between 2016-01-01 and 2016-05-01, which is the list you receive and excludes 2016-05-31.指定freq='M'包括 2016-01-01 和 2016-05-01 之间的月末,这是您收到的列表,不包括 2016-05-31。 But specifying month starts 'MS' like the second answer provides, includes 2016-05-01 as it falls within the range.但是指定月份开始'MS'就像第二个答案提供的那样,包括 2016-05-01,因为它在范围内。 pd.date_range() default behavior isn't like the range method since ends are included. pd.date_range()默认行为与range方法不同,因为包括了结束。 From the docs :文档

closed controls whether to include start and end that are on the boundary.封闭控制是否包括边界上的开始和结束。 The default includes boundary points on either end.默认包括两端的边界点。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM