简体   繁体   中英

Inserting "missing" multiindex rows into a Pandas Dataframe

I have a pandas DataFrame with a two-level multiindex. The second level is numeric and supposed to be sorted and sequential for each unique value of the first-level index, but has gaps. How do I insert the "missing" rows? Sample input:

import pandas as pd
df = pd.DataFrame(list(range(5)),
                  index=pd.MultiIndex.from_tuples([('A',1), ('A',3),
                                                   ('B',2), ('B',3), ('B',6)]),
                  columns='value')
#     value
#A 1      0
#  3      1
#B 2      2
#  3      3
#  6      4

Expected output:

#     value
#A 1      0
#  2    NaN
#  3      1
#B 2      2
#  3      3
#  4    NaN
#  5    NaN
#  6      4

I suspect I could have used resample , but I am having trouble converting the numbers to anything date-like.

If there is a will, there is a way. I am not proud of this but, I think it works.

Try:

def f(x):
    levels = x.index.remove_unused_levels().levels
    x = x.reindex(pd.MultiIndex.from_product([levels[0], np.arange(levels[1][0], levels[1][-1]+1)]))
    return x

df.groupby(level=0, as_index=False, group_keys=False).apply(f)

Output:

     value
A 1    0.0
  2    NaN
  3    1.0
B 2    2.0
  3    3.0
  4    NaN
  5    NaN
  6    4.0

After much deliberations, I was able to come up with a solution myself. Judging by the fact of how lousy it is, the problem I am facing is not a very typical one.

new_index = d.index.to_frame()\
                .groupby(0)[1]\
                .apply(lambda x:
                         pd.Series(1, index=range(x.min(), x.max() + 1))).index
d.reindex(new_index)

You can simply use the following depends on the missing index:

result.unstack(1).stack(0, dropna=False).fillna(0)

When you unstack, the pandas expand the df to have rows and columns and in the above example, level 1 index is going to be the column names. Then, again by stacking, you return the df to its original form, BUT, this time you need to make sure you use dropna=False so the NaN values are going to be there for missing indexes. In the end, using .fillna(0) is optional depends on what you want to do with the NaN values.

There's no accounting for taste, but I think falling back to list comprehension leads to slightly more readable code:

df.reindex(
    pd.MultiIndex.from_tuples([
        (level_0, level_1)
        for level_0 in df.reset_index(0).level_0.unique()
        for level_1 in range(
            df.reset_index(1).loc[level_0, "level_1"].min(),
            df.reset_index(1).loc[level_0, "level_1"].max()+1
        )
]))

# Output:
#value
#A  1   0.0
#   2   NaN
#   3   1.0
#B  2   2.0
#   3   3.0
#   4   NaN
#   5   NaN
#   6   4.0

Although this is of course slower than going down the apply route:

list-comprehension: 2.57 ms ± 19 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
DYZ apply: 1.25 ms ± 8.75 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Scott's apply: 2.19 ms ± 9.84 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM