[英]Data changes while interpolating data frame using Pandas and numpy
I am trying to calculate degree hours based on hourly temperature values. 我正在尝试根据小时温度值计算度数小时。 The data that I am using has some missing days and I am trying to interpolate that data.
我正在使用的数据缺少一些日期,因此我试图对这些数据进行插值。 Below is some part of the data;
以下是部分数据;
2012-06-27 19:00:00 24
2012-06-27 20:00:00 23
2012-06-27 21:00:00 23
2012-06-27 22:00:00 16
2012-06-27 23:00:00 15
2012-06-29 00:00:00 15
2012-06-29 01:00:00 16
2012-06-29 02:00:00 16
2012-06-29 03:00:00 16
2012-06-29 04:00:00 17
2012-06-29 05:00:00 17
2012-06-29 06:00:00 18
....
2014-12-14 20:00:00 1
2014-12-14 21:00:00 0
2014-12-14 22:00:00 -1
2014-12-14 23:00:00 8
The full code is; 完整的代码是;
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
filename = 'Temperature12.xls'
df_temp = pd.read_excel(filename)
df_temp = df_temp.set_index('datetime')
ts_temp = df_temp['temp']
def inter_lin_nan(ts_temp, rule):
ts_temp = ts_temp.resample(rule)
mask = np.isnan(ts_temp)
# interpolling missing values
ts_temp[mask] = np.interp(np.flatnonzero(mask), np.flatnonzero(~mask),ts_temp[~mask])
return(ts_temp)
ts_temp = inter_lin_nan(ts_temp,'1H')
print ts_temp['2014-06-28':'2014-06-29']
def HDH (Tcurr,Tref=15.0):
if Tref >= Tcurr:
return ((Tref-Tcurr)/24)
else:
return (0)
df_temp['H-Degreehours'] = df_temp.apply(lambda row: HDH(row['temp']),axis=1)
df_temp['CDD-CUMSUM'] = df_temp['C-Degreehours'].cumsum()
df_temp['HDD-CUMSUM'] = df_temp['H-Degreehours'].cumsum()
df_temp1=df_temp['H-Degreehours'].resample('H', how=sum)
print df_temp1
Now I have two questions; 现在我有两个问题; while using
inter_lin_nan
function, it does interpolate data but it also changes the next day data and the next data is totally different from the one available in the excel file. 使用
inter_lin_nan
函数时,它会插值数据,但它也会更改第二天的数据,而下一个数据与excel文件中提供的数据完全不同。 Is this common or I have missed something? 这很常见还是我错过了什么? Second question: At the end of the code I am trying to add hourly degree days values and that is why I have created another Data frame, but when I print that data frame, it still has NaN number as in the original data file.
第二个问题:在代码末尾,我试图添加小时度日值,这就是为什么我创建了另一个数据框,但是当我打印该数据框时,它仍然具有原始数据文件中的NaN号。 Could you please tell why this is happening?
你能告诉我为什么会这样吗? I may be missing something very obvious as I am new to Python.
由于我是Python的新手,我可能会遗漏一些非常明显的东西。
Don't use numpy when pandas has its own version. 当熊猫有自己的版本时,请不要使用numpy。
df = pd.read_csv(filepath)
df =df.asfreq('1d') #get a timeseries with index timestamps each day.
df['somelabel'] = df['somelabel'].interpolate(method='linear') # interpolate nan values
Use as frequency to add the required frequency of timestamps to your time series, and uses interpolate() to interpolate nan values only. 用作频率以将所需的时间戳频率添加到时间序列,并使用interpolate()仅插值nan值。
http://pandas.pydata.org/pandas-docs/version/0.17.1/generated/pandas.Series.interpolate.html http://pandas.pydata.org/pandas-docs/version/0.17.1/generated/pandas.Series.interpolate.html
http://pandas.pydata.org/pandas-docs/version/0.17.0/generated/pandas.DataFrame.asfreq.html http://pandas.pydata.org/pandas-docs/version/0.17.0/generated/pandas.DataFrame.asfreq.html
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.