简体   繁体   English

自动插入熊猫数据框中的缺失值

[英]Automating interpolation of missing values in pandas dataframe

I have a dataframe with airline booking data for the past year for a particular origin and destination. 我有一个数据框,其中包含过去一年中特定出发地和目的地的机票预订数据。 There are hundreds of similar data-sets in the system. 系统中有数百个相似的数据集。

In each data-set, there are holes in data. 在每个数据集中,数据中都有漏洞。 In the current example, we have about 85 days of year for which we don't have booking data. 在当前示例中,我们一年中大约有85天没有预订数据。

There are two columns here - departure_date and bookings. 这里有两列- departure_date and bookings.

The next step for me would be to include the missing dates in the date column, and set the corresponding values in bookings column to NaN. 对我而言,下一步是to include the missing dates in the date column, and set the corresponding values in bookings column to NaN.

I am looking for the best way to do this. 我正在寻找做到这一点的最佳方法。

Please find a part of the dataFrame below: 请在下面找到dataFrame的一部分:

Index       departure_date              bookings
0           2017-11-02 00:00:00             43
1           2017-11-03 00:00:00             27
2           2017-11-05 00:00:00             27 ********
3           2017-11-06 00:00:00             22
4           2017-11-07 00:00:00             39
.
.
164         2018-05-22 00:00:00             17
165         2018-05-23 00:00:00             41
166         2018-05-24 00:00:00             73
167         2018-07-02 00:00:00             4  *********
168         2018-07-03 00:00:00             31
.
.
277         2018-10-31 00:00:00             50
278         2018-11-01 00:00:00             60

We can see that the data-set is for a one year period (Nov 2, 2017 to Nov 1, 2018). 我们可以看到数据集是一年的时间段(2017年11月2日至2018年11月1日)。 But we have data for 279 days only. 但是我们只有279天的数据。 For example, we don't have any data between 2018-05-25 and 2018-07-01. 例如,我们在2018-05-25至2018-07-01之间没有任何数据。 I would have to include these dates in the departure_date column and set the corresponding booking values to NaN. 我必须将这些日期包括在离场日期列中,并将相应的预订值设置为NaN。

For the second step, I plan to do some interpolation using something like 对于第二步,我计划使用类似

dataFrame['bookings'].interpolate(method='time', inplace=True)

Please suggest if there are better alternatives in Python. 请提出在Python中是否还有更好的替代方法。

This resample for each day. 每天重新采样。 Then fill the gaps. 然后填补空白。

dataFrame['bookings'].resample('D').pad()

You can have more resampler idea on this page (so you can select the one that fit the best with your needs): https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.resample.html 您可以在此页面上有更多关于重采样器的想法(因此,您可以选择最适合您的需求): https : //pandas.pydata.org/pandas-docs/stable/genic/pandas.DataFrame.resample.html

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM