如何使用Python和/或R在數據幀之間插值

Question

我有一個如下所示的數據集：

我使用pandas.read_csv將“年份”和“國家/地區”列作為索引導入到pandas數據框中。 我需要做的是將時間步長從每5年改為每年，並插入所述值，我真的不知道如何做到這一點。 我正在學習R和python，所以對這兩種語言的幫助都會受到高度贊賞。

Answer 1

如果為DataFrame提供DatetimeIndex，則可以利用df.resample和df.interpolate('time')方法。

要使df.index成為DatetimeIndex，您可能會想要使用set_index('Year') 。 然而， Year本身並不是唯一的，因為每個Country都重復這Year 。 為了調用resample我們需要一個唯一的索引。 所以請改用df.pivot ：

 # convert integer years into `datetime64` values In [441]: df['Year'] = (df['Year'].astype('i8')-1970).view('datetime64[Y]') In [442]: df.pivot(index='Year', columns='Country') Out[442]: Avg1 Avg2 Country Australia Austria Belgium Australia Austria Belgium Year 1950-01-01 0 0 0 0 0 0 1955-01-01 1 1 1 10 10 10 1960-01-01 2 2 2 20 20 20 1965-01-01 3 3 3 30 30 30

然后，您可以使用df.resample('A').mean()以每年頻率重新采樣數據。 您可以將resample('A')視為將df切割成1年間隔的組。 resample返回DatetimeIndexResampler對象，其mean方法通過取均值來聚合每個組中的值。 因此， mean()返回每年一行的DataFrame。 由於您的原始df每5年有一個數據，因此大多數1年組都是空的，因此這些年份的均值返回NaN。 如果你的數據在5年的時間間隔，然后，而不是一貫間隔.mean()可以使用.first()或.last()來代替。 他們都會返回相同的結果。

 In [438]: df.resample('A').mean() Out[438]: Avg1 Avg2 Country Australia Austria Belgium Australia Austria Belgium Year 1950-12-31 0.0 0.0 0.0 0.0 0.0 0.0 1951-12-31 NaN NaN NaN NaN NaN NaN 1952-12-31 NaN NaN NaN NaN NaN NaN 1953-12-31 NaN NaN NaN NaN NaN NaN 1954-12-31 NaN NaN NaN NaN NaN NaN 1955-12-31 1.0 1.0 1.0 10.0 10.0 10.0 1956-12-31 NaN NaN NaN NaN NaN NaN 1957-12-31 NaN NaN NaN NaN NaN NaN 1958-12-31 NaN NaN NaN NaN NaN NaN 1959-12-31 NaN NaN NaN NaN NaN NaN 1960-12-31 2.0 2.0 2.0 20.0 20.0 20.0 1961-12-31 NaN NaN NaN NaN NaN NaN 1962-12-31 NaN NaN NaN NaN NaN NaN 1963-12-31 NaN NaN NaN NaN NaN NaN 1964-12-31 NaN NaN NaN NaN NaN NaN 1965-12-31 3.0 3.0 3.0 30.0 30.0 30.0

然后df.interpolate(method='time')將根據最近的非NaN值及其相關的日期時間索引值線性插入缺失的NaN值。

import numpy as np
import pandas as pd

countries = 'Australia Austria Belgium'.split()
year = np.arange(1950, 1970, 5)
df = pd.DataFrame(
    {'Country': np.repeat(countries, len(year)),
     'Year': np.tile(year, len(countries)),
     'Avg1': np.tile(np.arange(len(year)), len(countries)),
     'Avg2': 10*np.tile(np.arange(len(year)), len(countries))})
df['Year'] = (df['Year'].astype('i8')-1970).view('datetime64[Y]')
df = df.pivot(index='Year', columns='Country')

df = df.resample('A').mean()
df = df.interpolate(method='time')

df = df.stack('Country')
df = df.reset_index()
df = df.sort_values(by=['Country', 'Year'])
print(df)

產量

         Year    Country      Avg1       Avg2
0  1950-12-31  Australia  0.000000   0.000000
3  1951-12-31  Australia  0.199890   1.998905
6  1952-12-31  Australia  0.400329   4.003286
9  1953-12-31  Australia  0.600219   6.002191
12 1954-12-31  Australia  0.800110   8.001095
15 1955-12-31  Australia  1.000000  10.000000
18 1956-12-31  Australia  1.200328  12.003284
21 1957-12-31  Australia  1.400109  14.001095
...

Answer 2

這是一個艱難的，但我認為我有它。

這是一個示例數據框的示例：

df = pd.DataFrame({'country': ['australia', 'australia', 'belgium','belgium'], 
                   'year': [1980, 1985, 1980, 1985],
                   'data1': [1,5, 10, 15],
                   'data2': [100,110, 150,160]})
df = df.set_index(['country','year'])
countries = set(df.index.get_level_values(0))
df = df.reindex([(country, year) for country in countries for year in range(1980,1986)])
df = df.interpolate()
df = df.reset_index()

對於您的具體數據，假設每個國家/地區在1950年至2010年（含）之間每5年都有一次數據

df = pd.read_csv('path_to_data')
df = df.set_index(['country','year'])
countries = set(df.index.get_level_values(0))
df = df.reindex([(country, year) for country in countries for year in range(1950,2011)])
df = df.interpolate()
df = df.reset_index()

有點棘手的問題。 有興趣看看有人有更好的解決方案

Answer 3

首先，重新索引框架。 然后使用df.apply和Series.interpolate

就像是：

import pandas as pd

df = pd.read_csv(r'folder/file.txt')
rows = df.shape[0]
df.index = [x for x in range(0, 5*rows, 5)]
df = df.reindex(range(0, 5*rows))
df.apply(pandas.Series.interpolate)
df.apply(pd.Series.interpolate, inplace=True)

如何使用Python和/或R在數據幀之間插值

問題描述

3 個解決方案

解決方案1
6 2016-06-04 20:29:47

解決方案2
1 2016-06-04 19:59:44

解決方案3
0 2016-06-04 19:44:07

如何使用Python和/或R在數據幀之間插值

問題描述

3 個解決方案

解決方案1 6 2016-06-04 20:29:47

解決方案2 1 2016-06-04 19:59:44

解決方案3 0 2016-06-04 19:44:07

解決方案1
6 2016-06-04 20:29:47

解決方案2
1 2016-06-04 19:59:44

解決方案3
0 2016-06-04 19:44:07