简体   繁体   中英

Fill missing timeseries data using pandas or numpy

I have a list of dictionaries which looks like this :

L=[
{
"timeline": "2014-10", 
"total_prescriptions": 17
}, 
{
"timeline": "2014-11", 
"total_prescriptions": 14
}, 
{
"timeline": "2014-12", 
"total_prescriptions": 8
},
{
"timeline": "2015-1", 
"total_prescriptions": 4
}, 
{
"timeline": "2015-3", 
"total_prescriptions": 10
}, 
{
"timeline": "2015-4", 
"total_prescriptions": 3
} 
]

This basically is the result of a SQL query which when given a start date and an end date gives the count of total prescriptions for each month starting from the start date till the end month.However,for months where the prescriptions count is 0(Feb 2015),it completely skips that month.Is it possible using pandas or numpy to alter this list so that it adds an entry for the missing month with 0 as the total prescription as follows:

[
{
"timeline": "2014-10", 
"total_prescriptions": 17
}, 
{
"timeline": "2014-11", 
"total_prescriptions": 14
}, 
{
"timeline": "2014-12", 
"total_prescriptions": 8
{
"timeline": "2015-1", 
"total_prescriptions": 4
}, 
{
"timeline": "2015-2",   # 2015-2 to be inserted for missing month
"total_prescriptions": 0 # 0 to be inserted for total prescription
}, 
{
"timeline": "2015-3", 
"total_prescriptions": 10
}, 
{
"timeline": "2015-4", 
"total_prescriptions": 3
} 
]

What you are talking about is called "Resampling" in Pandas; first convert the your time to a numpy datetime and set as your index:

df = pd.DataFrame(L)
df.index=pd.to_datetime(df.timeline,format='%Y-%m')
df
           timeline  total_prescriptions
timeline                                
2014-10-01  2014-10                   17
2014-11-01  2014-11                   14
2014-12-01  2014-12                    8
2015-01-01   2015-1                    4
2015-03-01   2015-3                   10
2015-04-01   2015-4                    3

Then you can add in your missing months with resample('MS') (MS stands for "month start" I guess), and use fillna(0) to convert null values to zero as in your requirement.

df = df.resample('MS').fillna(0)
df
            total_prescriptions
timeline                       
2014-10-01                   17
2014-11-01                   14
2014-12-01                    8
2015-01-01                    4
2015-02-01                  NaN
2015-03-01                   10
2015-04-01                    3

To convert back to your original format, convert the datetime index back to string using to_native_types , and then export using to_dict('records') :

df['timeline']=df.index.to_native_types()
df.to_dict('records')
[{'timeline': '2014-10-01', 'total_prescriptions': 17.0},
 {'timeline': '2014-11-01', 'total_prescriptions': 14.0},
 {'timeline': '2014-12-01', 'total_prescriptions': 8.0},
 {'timeline': '2015-01-01', 'total_prescriptions': 4.0},
 {'timeline': '2015-02-01', 'total_prescriptions': 0.0},
 {'timeline': '2015-03-01', 'total_prescriptions': 10.0},
 {'timeline': '2015-04-01', 'total_prescriptions': 3.0}]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM