[英]Slow pandas DataFrame MultiIndex reindex
I have a pandas DataFrame of the form:我有一个 Pandas DataFrame 的形式:
id start_time sequence_no value
0 71 2018-10-17 20:12:43+00:00 114428 3
1 71 2018-10-17 20:12:43+00:00 114429 3
2 71 2018-10-17 20:12:43+00:00 114431 79
3 71 2019-11-06 00:51:14+00:00 216009 100
4 71 2019-11-06 00:51:14+00:00 216011 150
5 71 2019-11-06 00:51:14+00:00 216013 180
6 92 2019-12-01 00:51:14+00:00 114430 19
7 92 2019-12-01 00:51:14+00:00 114433 79
8 92 2019-12-01 00:51:14+00:00 114434 100
What I'm trying to do is fill in the missing sequence_no
per id
/ start_time
combo.我想要做的是填写每个
id
/ start_time
组合缺少的sequence_no
。 For example, the id
/ start_time
pairing of 71
and 2018-10-17 20:12:43+00:00
, is missing sequence_no 114430. For each added missing sequence_no, I also need average/interpolate the missing value
column value.例如,
71
和2018-10-17 20:12:43+00:00
的id
/ start_time
配对,缺少 sequence_no 114430。对于每个添加的缺失 sequence_no,我还需要对缺失value
列值进行平均/插值。 So, the final processing of the above data would end up looking like:因此,上述数据的最终处理最终将如下所示:
id start_time sequence_no value
0 71 2018-10-17 20:12:43+00:00 114428 3
1 71 2018-10-17 20:12:43+00:00 114429 3
2 71 2018-10-17 20:12:43+00:00 114430 41 **
3 71 2018-10-17 20:12:43+00:00 114431 79
4 71 2019-11-06 00:51:14+00:00 216009 100
5 71 2019-11-06 00:51:14+00:00 216010 125 **
6 71 2019-11-06 00:51:14+00:00 216011 150
7 71 2019-11-06 00:51:14+00:00 216012 165 **
8 71 2019-11-06 00:51:14+00:00 216013 180
9 92 2019-12-01 00:51:14+00:00 114430 19
10 92 2019-12-01 00:51:14+00:00 114431 39 **
11 92 2019-12-01 00:51:14+00:00 114432 59 **
12 92 2019-12-01 00:51:14+00:00 114433 79
13 92 2019-12-01 00:51:14+00:00 114434 100
( **
added to the right of newly inserted rows for easier readability) (
**
添加到新插入行的右侧以方便阅读)
My original solution for doing this relied heavily on Python loops over a large table of data, so it seemed like the ideal place for numpy and pandas to shine.我最初的解决方案在很大程度上依赖于对大型数据表的 Python 循环,因此它似乎是 numpy 和 pandas 大放异彩的理想场所。 Leaning on SO answers like Pandas: create rows to fill numeric gaps , I came up with:
依靠像Pandas: create rows to fill numeric gaps这样的答案,我想出了:
import pandas as pd
import numpy as np
# Generate dummy data
df = pd.DataFrame([
(71, '2018-10-17 20:12:43+00:00', 114428, 3),
(71, '2018-10-17 20:12:43+00:00', 114429, 3),
(71, '2018-10-17 20:12:43+00:00', 114431, 79),
(71, '2019-11-06 00:51:14+00:00', 216009, 100),
(71, '2019-11-06 00:51:14+00:00', 216011, 150),
(71, '2019-11-06 00:51:14+00:00', 216013, 180),
(92, '2019-12-01 00:51:14+00:00', 114430, 19),
(92, '2019-12-01 00:51:14+00:00', 114433, 79),
(92, '2019-12-01 00:51:14+00:00', 114434, 100),
], columns=['id', 'start_time', 'sequence_no', 'value'])
# create a new DataFrame with the min/max `sequence_no` values for each `id`/`start_time` pairing
by_start = df.groupby(['start_time', 'id'])
ranges = by_start.agg(
sequence_min=('sequence_no', np.min), sequence_max=('sequence_no', np.max)
)
reset = ranges.reset_index()
mins = reset['sequence_min']
maxes = reset['sequence_max']
# Use those min/max values to generate a sequence with ALL values in that range
expanded = pd.DataFrame(dict(
start_time=reset['start_time'].repeat(maxes - mins + 1),
id=reset['id'].repeat(maxes - mins + 1),
sequence_no=np.concatenate([np.arange(mins, maxes + 1) for mins, maxes in zip(mins, maxes)])
))
# Use the above generated DataFrame as an index to generate the missing rows, then interpolate
expanded_index = pd.MultiIndex.from_frame(expanded)
df.set_index(
['start_time', 'id', 'sequence_no']
).reindex(expanded_index).interpolate()
The output is correct, but it runs at almost exactly the same speed as my lots-of-python-loops solution.输出是正确的,但它的运行速度几乎与我的 lot-of-python-loops 解决方案完全相同。 I'm sure there are places I could cut out a few steps, but the slowest part in my testing appears to be the
reindex
.我确信在某些地方我可以减少一些步骤,但我测试中最慢的部分似乎是
reindex
。 Given that the real world data consists of almost a million rows (operated on frequently), are there any obvious ways to gain some performance advantage over what I've already written?鉴于现实世界的数据包含近一百万行(频繁操作),与我已经编写的内容相比,是否有任何明显的方法可以获得一些性能优势? Any ways I can speed up this transformation?
有什么方法可以加快这种转变?
Combining the merge solution from this answer with the original construction of the expanded dataframe yields that fastest results so far, when tested on a sufficiently large dataset:在足够大的数据集上进行测试时,将这个答案中的合并解决方案与扩展数据框的原始构造相结合,可以产生迄今为止最快的结果:
import pandas as pd
import numpy as np
# Generate dummy data
df = pd.DataFrame([
(71, '2018-10-17 20:12:43+00:00', 114428, 3),
(71, '2018-10-17 20:12:43+00:00', 114429, 3),
(71, '2018-10-17 20:12:43+00:00', 114431, 79),
(71, '2019-11-06 00:51:14+00:00', 216009, 100),
(71, '2019-11-06 00:51:14+00:00', 216011, 150),
(71, '2019-11-06 00:51:14+00:00', 216013, 180),
(92, '2019-12-01 00:51:14+00:00', 114430, 19),
(92, '2019-12-01 00:51:14+00:00', 114433, 79),
(92, '2019-12-01 00:51:14+00:00', 114434, 100),
], columns=['id', 'start_time', 'sequence_no', 'value'])
# create a ranges df with groupby and agg
ranges = df.groupby(['start_time', 'id'])['sequence_no'].agg([
('sequence_min', np.min), ('sequence_max', np.max)
])
reset = ranges.reset_index()
mins = reset['sequence_min']
maxes = reset['sequence_max']
# Use those min/max values to generate a sequence with ALL values in that range
expanded = pd.DataFrame(dict(
start_time=reset['start_time'].repeat(maxes - mins + 1),
id=reset['id'].repeat(maxes - mins + 1),
sequence_no=np.concatenate([np.arange(mins, maxes + 1) for mins, maxes in zip(mins, maxes)])
))
# merge expanded and df
merge = expanded.merge(df, on=['start_time', 'id', 'sequence_no'], how='left')
# interpolate and assign values
merge['value'] = merge['value'].interpolate()
using merge
instead of reindex
may speed things up.使用
merge
而不是reindex
可能会加快速度。 Also, using map instead of the list comprehension may as well.此外,也可以使用 map 而不是列表理解。
# Generate dummy data
df = pd.DataFrame([
(71, '2018-10-17 20:12:43+00:00', 114428, 3),
(71, '2018-10-17 20:12:43+00:00', 114429, 3),
(71, '2018-10-17 20:12:43+00:00', 114431, 79),
(71, '2019-11-06 00:51:14+00:00', 216009, 100),
(71, '2019-11-06 00:51:14+00:00', 216011, 150),
(71, '2019-11-06 00:51:14+00:00', 216013, 180),
(92, '2019-12-01 00:51:14+00:00', 114430, 19),
(92, '2019-12-01 00:51:14+00:00', 114433, 79),
(92, '2019-12-01 00:51:14+00:00', 114434, 100),
], columns=['id', 'start_time', 'sequence_no', 'value'])
# create a ranges df with groupby and agg
ranges = df.groupby(['start_time', 'id'])['sequence_no'].agg([('sequence_min', np.min), ('sequence_max', np.max)])
# map with range to create the sequence number rnage
ranges['sequence_no'] = list(map(lambda x,y: range(x,y), ranges.pop('sequence_min'), ranges.pop('sequence_max')+1))
# explode you DataFrame
new_df = ranges.explode('sequence_no')
# merge new_df and df
merge = new_df.reset_index().merge(df, on=['start_time', 'id', 'sequence_no'], how='left')
# interpolate and assign values
merge['value'] = merge['value'].interpolate()
start_time id sequence_no value
0 2018-10-17 20:12:43+00:00 71 114428 3.0
1 2018-10-17 20:12:43+00:00 71 114429 3.0
2 2018-10-17 20:12:43+00:00 71 114430 41.0
3 2018-10-17 20:12:43+00:00 71 114431 79.0
4 2019-11-06 00:51:14+00:00 71 216009 100.0
5 2019-11-06 00:51:14+00:00 71 216010 125.0
6 2019-11-06 00:51:14+00:00 71 216011 150.0
7 2019-11-06 00:51:14+00:00 71 216012 165.0
8 2019-11-06 00:51:14+00:00 71 216013 180.0
9 2019-12-01 00:51:14+00:00 92 114430 19.0
10 2019-12-01 00:51:14+00:00 92 114431 39.0
11 2019-12-01 00:51:14+00:00 92 114432 59.0
12 2019-12-01 00:51:14+00:00 92 114433 79.0
13 2019-12-01 00:51:14+00:00 92 114434 100.0
A shorter version of the merge
solution: merge
解决方案的较短版本:
df.groupby(['start_time', 'id'])['sequence_no']\
.apply(lambda x: np.arange(x.min(), x.max() + 1))\
.explode().reset_index()\
.merge(df, on=['start_time', 'id', 'sequence_no'], how='left')\
.interpolate()
Output:输出:
start_time id sequence_no value
0 2018-10-17 20:12:43+00:00 71 114428 3.0
1 2018-10-17 20:12:43+00:00 71 114429 3.0
2 2018-10-17 20:12:43+00:00 71 114430 41.0
3 2018-10-17 20:12:43+00:00 71 114431 79.0
4 2019-11-06 00:51:14+00:00 71 216009 100.0
5 2019-11-06 00:51:14+00:00 71 216010 125.0
6 2019-11-06 00:51:14+00:00 71 216011 150.0
7 2019-11-06 00:51:14+00:00 71 216012 165.0
8 2019-11-06 00:51:14+00:00 71 216013 180.0
9 2019-12-01 00:51:14+00:00 92 114430 19.0
10 2019-12-01 00:51:14+00:00 92 114431 39.0
11 2019-12-01 00:51:14+00:00 92 114432 59.0
12 2019-12-01 00:51:14+00:00 92 114433 79.0
13 2019-12-01 00:51:14+00:00 92 114434 100.0
Another solution with reindex
without using explode
:用另一种解决方案
reindex
,而不使用explode
:
result = (df.groupby(["id","start_time"])
.apply(lambda d: d.set_index("sequence_no")
.reindex(range(min(d["sequence_no"]),max(d["sequence_no"])+1)))
.drop(["id","start_time"],axis=1).reset_index()
.interpolate())
print (result)
#
id start_time sequence_no value
0 71 2018-10-17 20:12:43+00:00 114428 3.0
1 71 2018-10-17 20:12:43+00:00 114429 3.0
2 71 2018-10-17 20:12:43+00:00 114430 41.0
3 71 2018-10-17 20:12:43+00:00 114431 79.0
4 71 2019-11-06 00:51:14+00:00 216009 100.0
5 71 2019-11-06 00:51:14+00:00 216010 125.0
6 71 2019-11-06 00:51:14+00:00 216011 150.0
7 71 2019-11-06 00:51:14+00:00 216012 165.0
8 71 2019-11-06 00:51:14+00:00 216013 180.0
9 92 2019-12-01 00:51:14+00:00 114430 19.0
10 92 2019-12-01 00:51:14+00:00 114431 39.0
11 92 2019-12-01 00:51:14+00:00 114432 59.0
12 92 2019-12-01 00:51:14+00:00 114433 79.0
13 92 2019-12-01 00:51:14+00:00 114434 100.0
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.