简体   繁体   English

使用Pandas和numpy插值数据帧时数据发生变化

[英]Data changes while interpolating data frame using Pandas and numpy

I am trying to calculate degree hours based on hourly temperature values. 我正在尝试根据小时温度值计算度数小时。 The data that I am using has some missing days and I am trying to interpolate that data. 我正在使用的数据缺少一些日期,因此我试图对这些数据进行插值。 Below is some part of the data; 以下是部分数据;

2012-06-27 19:00:00 24
2012-06-27 20:00:00 23
2012-06-27 21:00:00 23
2012-06-27 22:00:00 16
2012-06-27 23:00:00 15
2012-06-29 00:00:00 15
2012-06-29 01:00:00 16
2012-06-29 02:00:00 16
2012-06-29 03:00:00 16
2012-06-29 04:00:00 17
2012-06-29 05:00:00 17
2012-06-29 06:00:00 18
....
2014-12-14 20:00:00 1
2014-12-14 21:00:00 0
2014-12-14 22:00:00 -1
2014-12-14 23:00:00 8

The full code is; 完整的代码是;

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
filename = 'Temperature12.xls'
df_temp = pd.read_excel(filename)
df_temp = df_temp.set_index('datetime')
ts_temp = df_temp['temp']
def inter_lin_nan(ts_temp, rule):
    ts_temp = ts_temp.resample(rule)
    mask = np.isnan(ts_temp)
    # interpolling missing values
    ts_temp[mask] = np.interp(np.flatnonzero(mask), np.flatnonzero(~mask),ts_temp[~mask])
    return(ts_temp)
ts_temp = inter_lin_nan(ts_temp,'1H')
print ts_temp['2014-06-28':'2014-06-29']
def HDH (Tcurr,Tref=15.0):
    if Tref >= Tcurr:
        return ((Tref-Tcurr)/24)
    else:
        return (0)
df_temp['H-Degreehours'] = df_temp.apply(lambda row: HDH(row['temp']),axis=1)
df_temp['CDD-CUMSUM'] = df_temp['C-Degreehours'].cumsum()
df_temp['HDD-CUMSUM'] = df_temp['H-Degreehours'].cumsum()
df_temp1=df_temp['H-Degreehours'].resample('H', how=sum)
print df_temp1

Now I have two questions; 现在我有两个问题; while using inter_lin_nan function, it does interpolate data but it also changes the next day data and the next data is totally different from the one available in the excel file. 使用inter_lin_nan函数时,它会插值数据,但它也会更改第二天的数据,而下一个数据与excel文件中提供的数据完全不同。 Is this common or I have missed something? 这很常见还是我错过了什么? Second question: At the end of the code I am trying to add hourly degree days values and that is why I have created another Data frame, but when I print that data frame, it still has NaN number as in the original data file. 第二个问题:在代码末尾,我试图添加小时度日值,这就是为什么我创建了另一个数据框,但是当我打印该数据框时,它仍然具有原始数据文件中的NaN号。 Could you please tell why this is happening? 你能告诉我为什么会这样吗? I may be missing something very obvious as I am new to Python. 由于我是Python的新手,我可能会遗漏一些非常明显的东西。

Don't use numpy when pandas has its own version. 当熊猫有自己的版本时,请不要使用numpy。

df = pd.read_csv(filepath)
df  =df.asfreq('1d') #get a timeseries with index timestamps each day.
df['somelabel'] = df['somelabel'].interpolate(method='linear') # interpolate nan values

Use as frequency to add the required frequency of timestamps to your time series, and uses interpolate() to interpolate nan values only. 用作频率以将所需的时间戳频率添加到时间序列,并使用interpolate()仅插值nan值。

http://pandas.pydata.org/pandas-docs/version/0.17.1/generated/pandas.Series.interpolate.html http://pandas.pydata.org/pandas-docs/version/0.17.1/generated/pandas.Series.interpolate.html

http://pandas.pydata.org/pandas-docs/version/0.17.0/generated/pandas.DataFrame.asfreq.html http://pandas.pydata.org/pandas-docs/version/0.17.0/generated/pandas.DataFrame.asfreq.html

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM