简体   繁体   English

Python线性插值数据框

[英]Python linear interpolation of values in dataframe

I have a python dataframe with hourly values for Jan 2015 except some hours are missing the index and values both. 我有一个Python数据框,其中包含2015年1月的小时值,但有些小时缺少索引和值。 Ideally the dataframe with columns named "dates" and "values" should have 744 rows in it. 理想情况下,具有名为“日期”和“值”的列的数据框应具有744行。 However, it has randomly missing 10 hours and hence has only 734 rows. 但是,它随机丢失了10个小时,因此只有734行。 I want to interpolate for missing hours in the month to create the desired dataframe with 744 "dates" and 744 "values". 我想对月份中缺少的小时进行插值,以创建具有744个“日期”和744个“值”的所需数据框。

Edit: 编辑:

I am new to python so I am struggling with implementing this idea: 我是python的新手,所以我正在努力实现这个想法:

  • Create a dataframe with first column as all hours in Jan 2015 使用第一列作为2015年1月的所有时间来创建数据框
  • Create the second column of same size as first of NANs 创建与NAN的大小相同的第二列
  • Fill the second column with available values hence the missing hours have NANs in them 用可用值填充第二列,因此缺少的小时数中包含NAN
  • Use the panda interpolate funtion 使用熊猫插值函数

Edit2: 编辑2:

I was looking for hint for code snippets. 我正在寻找代码片段的提示。 Based on suggestion below I was able to create the following code but it fails to fill in the values which are zeros at the start of the month ie for hours 1 through 5 on Jan 1. 根据以下建议,我能够创建以下代码,但未能在月初(即1月1日的第1到5个小时)填写零值。

import panda as pd
st_dt   =   '2015-01-01'
en_dt   =   '2015-01-31'
DateTimeHour =   pd.date_range( pd.Timestamp( st_dt ).date(), pd.Timestamp(    
en_dt ).date(), freq='H')
Pwr.index    =   pd.DatetimeIndex(Pwr.index) #Pwr is the original dataframe
Pwr          =   Pwr.reindex( DateTimeHour, fill_value = 0 )
Pwr2         =   pd.Series( Pwr.values )
Pwr2.interpolate( imit_direction='both' )

What you want requires a combination of this technique: Add missing dates to pandas dataframe 您想要什么需要此技术的组合: 将缺失的日期添加到熊猫数据框

And the pandas function pandas.Series.interpolate . 熊猫函数pandas.Series.interpolate From what you've said, the option 'linear' is what you want. 从您所说的来看,“线性”选项就是您想要的。

EDIT: 编辑:
Interpolate will not work in the case were you have datapoints missing at the very start of the time series. 如果您在时间序列的开始就缺少数据点,则无法进行插值。 One idea is to use pandas.Series.fillna with 'backfill' after the interpolation. 一种想法是在插值后将pandas.Series.fillna与'backfill'一起使用。 Also, do not set fill_value to 0 whe you call reindex 另外,调用reindex时,请勿将fill_value设置为0

Use df.asfreq to expand the DataFrame so as to have an hourly frequency. 使用df.asfreq扩展DataFrame,使其具有每小时频率。 NaN is inserted for missing values: 插入NaN以获取缺失值:

df = df.asfreq('H')

then use df.interpolate to replace the NaNs with (linearly) interpolated values based on the DatetimeIndex and the nearest non-NaN values: 然后使用df.interpolate根据日期时间df.interpolate和最接近的非NaN值将NaN替换为(线性)内插值:

df = df.interpolate(method='time')

For example, 例如,

import numpy as np
import pandas as pd

N, M = 744, 734
index = pd.date_range('2015-01-01', periods=N, freq='H')
idx = np.random.choice(np.arange(N), M, replace=False)
idx.sort()
index = index[idx]

# This creates a toy DataFrame with 734 non-null rows:
df = pd.DataFrame({'values': np.random.randint(10, size=(M,))}, index=index)

# This expands the DataFrame to 744 rows (10 null rows):
df = df.asfreq('H')

# This makes `df` have 744 non-null rows:
df = df.interpolate(method='time')

A general interpolation is the following: 常规插值如下:

If the key exits: 如果密钥退出:

  • Return the value 返回值

else: 其他:

  • Find the first key before and after the required key, find the distance (which you can define using a desired metric) to both keys and take a weighted average of the values, weighed by the distances of the keys (close is heigher weight). 找到所需键之前和之后的第一个键,找到两个键之间的距离(您可以使用所需的度量标准来定义),然后取值的加权平均值,并按键的距离进行权衡(接近表示权重)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM