简体   繁体   English

慢熊猫 DataFrame MultiIndex 重新索引

[英]Slow pandas DataFrame MultiIndex reindex

I have a pandas DataFrame of the form:我有一个 Pandas DataFrame 的形式:

                       id                start_time  sequence_no    value
0                      71 2018-10-17 20:12:43+00:00       114428        3
1                      71 2018-10-17 20:12:43+00:00       114429        3
2                      71 2018-10-17 20:12:43+00:00       114431       79
3                      71 2019-11-06 00:51:14+00:00       216009      100
4                      71 2019-11-06 00:51:14+00:00       216011      150
5                      71 2019-11-06 00:51:14+00:00       216013      180
6                      92 2019-12-01 00:51:14+00:00       114430       19
7                      92 2019-12-01 00:51:14+00:00       114433       79
8                      92 2019-12-01 00:51:14+00:00       114434      100

What I'm trying to do is fill in the missing sequence_no per id / start_time combo.我想要做的是填写每个id / start_time组合缺少的sequence_no For example, the id / start_time pairing of 71 and 2018-10-17 20:12:43+00:00 , is missing sequence_no 114430. For each added missing sequence_no, I also need average/interpolate the missing value column value.例如, 712018-10-17 20:12:43+00:00id / start_time配对,缺少 sequence_no 114430。对于每个添加的缺失 sequence_no,我还需要对缺失value列值进行平均/插值。 So, the final processing of the above data would end up looking like:因此,上述数据的最终处理最终将如下所示:

                       id                start_time  sequence_no    value
0                      71 2018-10-17 20:12:43+00:00       114428        3
1                      71 2018-10-17 20:12:43+00:00       114429        3
2                      71 2018-10-17 20:12:43+00:00       114430       41  **
3                      71 2018-10-17 20:12:43+00:00       114431       79
4                      71 2019-11-06 00:51:14+00:00       216009      100  
5                      71 2019-11-06 00:51:14+00:00       216010      125  **
6                      71 2019-11-06 00:51:14+00:00       216011      150
7                      71 2019-11-06 00:51:14+00:00       216012      165  **
8                      71 2019-11-06 00:51:14+00:00       216013      180
9                      92 2019-12-01 00:51:14+00:00       114430       19
10                     92 2019-12-01 00:51:14+00:00       114431       39  **
11                     92 2019-12-01 00:51:14+00:00       114432       59  **
12                     92 2019-12-01 00:51:14+00:00       114433       79
13                     92 2019-12-01 00:51:14+00:00       114434      100

( ** added to the right of newly inserted rows for easier readability) ( **添加到新插入行的右侧以方便阅读)

My original solution for doing this relied heavily on Python loops over a large table of data, so it seemed like the ideal place for numpy and pandas to shine.我最初的解决方案在很大程度上依赖于对大型数据表的 Python 循环,因此它似乎是 numpy 和 pandas 大放异彩的理想场所。 Leaning on SO answers like Pandas: create rows to fill numeric gaps , I came up with:依靠像Pandas: create rows to fill numeric gaps这样的答案,我想出了:

import pandas as pd
import numpy as np

# Generate dummy data
df = pd.DataFrame([
    (71, '2018-10-17 20:12:43+00:00', 114428, 3),
    (71, '2018-10-17 20:12:43+00:00', 114429, 3),
    (71, '2018-10-17 20:12:43+00:00', 114431, 79),
    (71, '2019-11-06 00:51:14+00:00', 216009, 100),
    (71, '2019-11-06 00:51:14+00:00', 216011, 150),
    (71, '2019-11-06 00:51:14+00:00', 216013, 180),
    (92, '2019-12-01 00:51:14+00:00', 114430, 19),
    (92, '2019-12-01 00:51:14+00:00', 114433, 79),
    (92, '2019-12-01 00:51:14+00:00', 114434, 100),   
], columns=['id', 'start_time', 'sequence_no', 'value'])

# create a new DataFrame with the min/max `sequence_no` values for each `id`/`start_time` pairing
by_start = df.groupby(['start_time', 'id'])
ranges = by_start.agg(
    sequence_min=('sequence_no', np.min), sequence_max=('sequence_no', np.max)
)
reset = ranges.reset_index()

mins = reset['sequence_min']
maxes = reset['sequence_max']

# Use those min/max values to generate a sequence with ALL values in that range
expanded = pd.DataFrame(dict(
    start_time=reset['start_time'].repeat(maxes - mins + 1),
    id=reset['id'].repeat(maxes - mins + 1),
    sequence_no=np.concatenate([np.arange(mins, maxes + 1) for mins, maxes in zip(mins, maxes)])
))

# Use the above generated DataFrame as an index to generate the missing rows, then interpolate
expanded_index = pd.MultiIndex.from_frame(expanded)
df.set_index(
    ['start_time', 'id', 'sequence_no']
).reindex(expanded_index).interpolate()

The output is correct, but it runs at almost exactly the same speed as my lots-of-python-loops solution.输出是正确的,但它的运行速度几乎与我的 lot-of-python-loops 解决方案完全相同。 I'm sure there are places I could cut out a few steps, but the slowest part in my testing appears to be the reindex .我确信在某些地方我可以减少一些步骤,但我测试中最慢的部分似乎是reindex Given that the real world data consists of almost a million rows (operated on frequently), are there any obvious ways to gain some performance advantage over what I've already written?鉴于现实世界的数据包含近一百万行(频繁操作),与我已经编写的内容相比,是否有任何明显的方法可以获得一些性能优势? Any ways I can speed up this transformation?有什么方法可以加快这种转变?

Update 9/12/2019 2019 年 9 月 12 日更新

Combining the merge solution from this answer with the original construction of the expanded dataframe yields that fastest results so far, when tested on a sufficiently large dataset:在足够大的数据集上进行测试时,将这个答案中的合并解决方案与扩展数据框的原始构造相结合,可以产生迄今为止最快的结果:

import pandas as pd
import numpy as np

# Generate dummy data
df = pd.DataFrame([
    (71, '2018-10-17 20:12:43+00:00', 114428, 3),
    (71, '2018-10-17 20:12:43+00:00', 114429, 3),
    (71, '2018-10-17 20:12:43+00:00', 114431, 79),
    (71, '2019-11-06 00:51:14+00:00', 216009, 100),
    (71, '2019-11-06 00:51:14+00:00', 216011, 150),
    (71, '2019-11-06 00:51:14+00:00', 216013, 180),
    (92, '2019-12-01 00:51:14+00:00', 114430, 19),
    (92, '2019-12-01 00:51:14+00:00', 114433, 79),
    (92, '2019-12-01 00:51:14+00:00', 114434, 100),   
], columns=['id', 'start_time', 'sequence_no', 'value'])

# create a ranges df with groupby and agg
ranges = df.groupby(['start_time', 'id'])['sequence_no'].agg([
    ('sequence_min', np.min), ('sequence_max', np.max)
])
reset = ranges.reset_index()

mins = reset['sequence_min']
maxes = reset['sequence_max']

# Use those min/max values to generate a sequence with ALL values in that range
expanded = pd.DataFrame(dict(
    start_time=reset['start_time'].repeat(maxes - mins + 1),
    id=reset['id'].repeat(maxes - mins + 1),
    sequence_no=np.concatenate([np.arange(mins, maxes + 1) for mins, maxes in zip(mins, maxes)])
))

# merge expanded and df
merge = expanded.merge(df, on=['start_time', 'id', 'sequence_no'], how='left')
# interpolate and assign values 
merge['value'] = merge['value'].interpolate()

using merge instead of reindex may speed things up.使用merge而不是reindex可能会加快速度。 Also, using map instead of the list comprehension may as well.此外,也可以使用 map 而不是列表理解。

# Generate dummy data
df = pd.DataFrame([
    (71, '2018-10-17 20:12:43+00:00', 114428, 3),
    (71, '2018-10-17 20:12:43+00:00', 114429, 3),
    (71, '2018-10-17 20:12:43+00:00', 114431, 79),
    (71, '2019-11-06 00:51:14+00:00', 216009, 100),
    (71, '2019-11-06 00:51:14+00:00', 216011, 150),
    (71, '2019-11-06 00:51:14+00:00', 216013, 180),
    (92, '2019-12-01 00:51:14+00:00', 114430, 19),
    (92, '2019-12-01 00:51:14+00:00', 114433, 79),
    (92, '2019-12-01 00:51:14+00:00', 114434, 100),   
], columns=['id', 'start_time', 'sequence_no', 'value'])

# create a ranges df with groupby and agg
ranges = df.groupby(['start_time', 'id'])['sequence_no'].agg([('sequence_min', np.min), ('sequence_max', np.max)])
# map with range to create the sequence number rnage
ranges['sequence_no'] = list(map(lambda x,y: range(x,y), ranges.pop('sequence_min'), ranges.pop('sequence_max')+1))
# explode you DataFrame
new_df = ranges.explode('sequence_no')
# merge new_df and df
merge = new_df.reset_index().merge(df, on=['start_time', 'id', 'sequence_no'], how='left')
# interpolate and assign values 
merge['value'] = merge['value'].interpolate()

                   start_time  id sequence_no  value
0   2018-10-17 20:12:43+00:00  71      114428    3.0
1   2018-10-17 20:12:43+00:00  71      114429    3.0
2   2018-10-17 20:12:43+00:00  71      114430   41.0
3   2018-10-17 20:12:43+00:00  71      114431   79.0
4   2019-11-06 00:51:14+00:00  71      216009  100.0
5   2019-11-06 00:51:14+00:00  71      216010  125.0
6   2019-11-06 00:51:14+00:00  71      216011  150.0
7   2019-11-06 00:51:14+00:00  71      216012  165.0
8   2019-11-06 00:51:14+00:00  71      216013  180.0
9   2019-12-01 00:51:14+00:00  92      114430   19.0
10  2019-12-01 00:51:14+00:00  92      114431   39.0
11  2019-12-01 00:51:14+00:00  92      114432   59.0
12  2019-12-01 00:51:14+00:00  92      114433   79.0
13  2019-12-01 00:51:14+00:00  92      114434  100.0

A shorter version of the merge solution: merge解决方案的较短版本:

df.groupby(['start_time', 'id'])['sequence_no']\
.apply(lambda x: np.arange(x.min(), x.max() + 1))\
.explode().reset_index()\
.merge(df, on=['start_time', 'id', 'sequence_no'], how='left')\
.interpolate()

Output:输出:

                   start_time  id sequence_no  value
0   2018-10-17 20:12:43+00:00  71      114428    3.0
1   2018-10-17 20:12:43+00:00  71      114429    3.0
2   2018-10-17 20:12:43+00:00  71      114430   41.0
3   2018-10-17 20:12:43+00:00  71      114431   79.0
4   2019-11-06 00:51:14+00:00  71      216009  100.0
5   2019-11-06 00:51:14+00:00  71      216010  125.0
6   2019-11-06 00:51:14+00:00  71      216011  150.0
7   2019-11-06 00:51:14+00:00  71      216012  165.0
8   2019-11-06 00:51:14+00:00  71      216013  180.0
9   2019-12-01 00:51:14+00:00  92      114430   19.0
10  2019-12-01 00:51:14+00:00  92      114431   39.0
11  2019-12-01 00:51:14+00:00  92      114432   59.0
12  2019-12-01 00:51:14+00:00  92      114433   79.0
13  2019-12-01 00:51:14+00:00  92      114434  100.0

Another solution with reindex without using explode :用另一种解决方案reindex ,而不使用explode

result = (df.groupby(["id","start_time"])
          .apply(lambda d: d.set_index("sequence_no")
          .reindex(range(min(d["sequence_no"]),max(d["sequence_no"])+1)))
          .drop(["id","start_time"],axis=1).reset_index()
          .interpolate())

print (result)

#
    id                 start_time  sequence_no  value
0   71  2018-10-17 20:12:43+00:00       114428    3.0
1   71  2018-10-17 20:12:43+00:00       114429    3.0
2   71  2018-10-17 20:12:43+00:00       114430   41.0
3   71  2018-10-17 20:12:43+00:00       114431   79.0
4   71  2019-11-06 00:51:14+00:00       216009  100.0
5   71  2019-11-06 00:51:14+00:00       216010  125.0
6   71  2019-11-06 00:51:14+00:00       216011  150.0
7   71  2019-11-06 00:51:14+00:00       216012  165.0
8   71  2019-11-06 00:51:14+00:00       216013  180.0
9   92  2019-12-01 00:51:14+00:00       114430   19.0
10  92  2019-12-01 00:51:14+00:00       114431   39.0
11  92  2019-12-01 00:51:14+00:00       114432   59.0
12  92  2019-12-01 00:51:14+00:00       114433   79.0
13  92  2019-12-01 00:51:14+00:00       114434  100.0

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM