[英]interpolation of missing values, not NA
i want to interpolate (Linear interpolation) data.我想插值(线性插值)数据。 but There is no NA.
但没有NA。
Here is my data.with many missing values.这是我的数据。有许多缺失值。
timestamp![]() |
id ![]() |
strength![]() |
---|---|---|
1383260400000 ![]() |
1 ![]() |
-0.3803901328171995 ![]() |
1383261000000 ![]() |
1 ![]() |
-0.42196042219455937 ![]() |
1383265200000 ![]() |
1 ![]() |
-0.460714706261982 ![]() |
My expected :我的预期:
timestamp![]() |
id ![]() |
strength![]() |
---|---|---|
1383260400000 ![]() |
1 ![]() |
-0.3803901328171995 ![]() |
1383261000000 ![]() |
1 ![]() |
-0.42196042219455937 ![]() |
1383261600000 ![]() |
1 ![]() |
Linear interpolated data![]() |
1383262200000 ![]() |
1 ![]() |
Linear interpolated data![]() |
1383262800000 ![]() |
1 ![]() |
Linear interpolated data![]() |
1383263400000 ![]() |
1 ![]() |
Linear interpolated data![]() |
1383264000000 ![]() |
1 ![]() |
Linear interpolated data![]() |
1383264600000 ![]() |
1 ![]() |
Linear interpolated data![]() |
1383265200000 ![]() |
1 ![]() |
-0.460714706261982 ![]() |
timestamp starts 1383260400000, ends 1383343800000 and another id(from 1 to 2025) has same issues.时间戳从 1383260400000 开始,到 1383343800000 结束,另一个 id(从 1 到 2025)也有同样的问题。
Idea is create datetimes, convert to DatetimeIndex
and in lambda function add missing datetimes by Series.asfreq
with interpolate:想法是创建日期时间,转换为
DatetimeIndex
并在 lambda 函数中通过Series.asfreq
添加缺少的日期时间,并进行插值:
df['timestamp'] = pd.to_datetime(df['timestamp'], unit='ms')
f = lambda x: x.asfreq('10Min').interpolate()
df = df.set_index('timestamp').groupby('id')['strength'].apply(f).reset_index()
print (df)
id timestamp strength
0 1 2013-10-31 23:00:00 -0.380390
1 1 2013-10-31 23:10:00 -0.421960
2 1 2013-10-31 23:20:00 -0.427497
3 1 2013-10-31 23:30:00 -0.433033
4 1 2013-10-31 23:40:00 -0.438569
5 1 2013-10-31 23:50:00 -0.444106
6 1 2013-11-01 00:00:00 -0.449642
7 1 2013-11-01 00:10:00 -0.455178
8 1 2013-11-01 00:20:00 -0.460715
Last if need original format of timestamps:最后如果需要原始格式的时间戳:
df['timestamp'] = df['timestamp'].astype(np.int64) // 1000000
print (df)
id timestamp strength
0 1 1383260400000 -0.380390
1 1 1383261000000 -0.421960
2 1 1383261600000 -0.427497
3 1 1383262200000 -0.433033
4 1 1383262800000 -0.438569
5 1 1383263400000 -0.444106
6 1 1383264000000 -0.449642
7 1 1383264600000 -0.455178
8 1 1383265200000 -0.460715
EDIT:编辑:
#data from question
df =pd.DataFrame({'timestamp': [1383260400000, 1383261000000, 1383265200000],
'id': [1, 1, 1],
'strength':[-0.3803901328171995,-0.4219604221945593,-0.460714706261982]})
print (df)
timestamp id strength
0 1383260400000 1 -0.380390
1 1383261000000 1 -0.421960
2 1383265200000 1 -0.460715
Solution create for each id
all datetimes by date_range
and create missing values by DataFrame.reindex
with MultiIndex
, last per id
is used interpolate:解决方案通过
date_range
为每个id
创建所有日期时间,并通过DataFrame.reindex
使用MultiIndex
创建缺失值,最后一个每个id
用于插值:
df['timestamp'] = pd.to_datetime(df['timestamp'], unit='ms')
r = pd.date_range(pd.to_datetime(1383260400000, unit='ms') ,
pd.to_datetime(1383343800000, unit='ms'),
freq='10Min')
ids = df['id'].unique()
mux = pd.MultiIndex.from_product([r, ids], names=['timestamp','id'])
f = lambda x: x.interpolate()
df = (df.set_index(['timestamp', 'id'])
.reindex(mux)
.groupby('id')['strength']
.transform(f)
.reset_index())
print (df)
timestamp id strength
0 2013-10-31 23:00:00 1 -0.380390
1 2013-10-31 23:10:00 1 -0.421960
2 2013-10-31 23:20:00 1 -0.427497
3 2013-10-31 23:30:00 1 -0.433033
4 2013-10-31 23:40:00 1 -0.438569
.. ... .. ...
135 2013-11-01 21:30:00 1 -0.460715
136 2013-11-01 21:40:00 1 -0.460715
137 2013-11-01 21:50:00 1 -0.460715
138 2013-11-01 22:00:00 1 -0.460715
139 2013-11-01 22:10:00 1 -0.460715
[140 rows x 3 columns]
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.