简体   繁体   中英

pandas partial join on multiindex

So, this is my problem:

dfa = pd.DataFrame({"a": [["a", "b", "c"][int(k/10)] for k in range(30)],
                    "b": ["a" + repr([10, 20, 30, 40, 50, 60][int(k/5)]) for k in range(30)],
                    "c": np.arange(30),
                    "d": np.random.normal(size=30)}).set_index(["a","b","c"])
dfb = pd.DataFrame({"a": [["a", "b", "c"][int(k/2)] for k in range(6)],
                    "b": ["a" + repr([10, 20, 30, 40, 50, 60][k]) for k in range(6)],
                    "m": np.random.normal(size=6)**2}).set_index(["a","b"])

Essentially I have two dataframes with multi-indices and I want to divide dfa.d by dfb.m , joining on ("a", "b") . I can't naively do dfa.d / dfb.m or join because it says that merging with more than one level overlap on a multi-index is not implemented .

The most straightforward (lol) way of doing this that I found is:

dfc = dfa.reset_index().set_index(["a", "b"]).join(dfb)
dfc["r"] = dfc.d / dfc.m
dfd = dfc.reset_index().set_index(["a", "b", "c"])[["r"]]

Any shortcuts?

There's an open bug for this problem and the current milestone says 0.15.1 .

Until something nicer comes along, there's a workaround involving the following steps:

  • get the non-matching index level(s) out the way by unstack ing them into columns
  • perform the multiplication/division
  • stack the columns back to where they were.

Like this:

In [109]: dfa.unstack('c').mul(dfb.squeeze(), axis=0).stack('c')
Out[109]: 
                  d
a b   c            
a a10 0    1.535221
      1   -2.151894
      2    1.986061
      3   -1.946031
      4   -4.868800
  a20 5   -2.278917
      6   -1.535684
      7    2.289102
      8   -0.442284
      9   -0.547209
b a30 10 -12.568426
      11   7.180348
      12   1.584510
      13   3.419332
      14  -3.011810
  a40 15  -0.367091
      16   4.264955
      17   2.410733
      18   0.030926
      19   1.219653
c a50 20   0.110586
      21  -0.430263
      22   0.350308
      23   1.101523
      24  -1.371180
  a60 25  -0.003683
      26   0.069884
      27   0.206635
      28   0.356708
      29   0.111380

Notice two things:

  1. dfb has to be a Series , otherwise there's additional complication about which columns of dfb to use for the multiplication. You could replace dfb.squeeze() with dfb['m'] .
  2. If the non-matching index was not already the last of the three, the order of the index levels would not be preserved. In this case, do what @jreback suggests and reorder the index levels afterwards: .reorder_levels(dfa.index.names)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM