简体   繁体   中英

Is there a way to concatenate the output of a pandas apply into a single multiindex?

I have a dataframe with two columns where each of the two columns contains a list indices. I want to get the product of the two lists on a row by row level to create a multiindex.

For example, df1 below

|---------------------|------------------|
|        col_a        |        col_b     |
|---------------------|------------------|
|       [A1, A2]      |        [B1]      |
|---------------------|------------------|
|          [A3]       |      [B2, B3]    |
|---------------------|------------------|

would turn into this:

MultiIndex([('A1', 'B1'),
            ('A2', 'B1'),
            ('A3', 'B2'),
            ('A3', 'B3')],
names = ['col_a', 'col_b'], length = 4)

What I'm doing right now is the following:

my_multiindex = df1.apply(lambda row: pd.MultiIndex.from_product([row["col_a"], row["col_b"]], names = ['col_a', 'col_b']), axis = 1)

However, the output of this is looks like:

0     MultiIndex([('A1', 'B1'), ('A2', 'B1')])
1     MultiIndex([('A3', 'B2'), ('A3', 'B3')])

One object per row, which makes sense since we are using an apply on axis 1.

Is there any way that I can "concatenate" all the output objects into a single multiindex? Preferably this can be done within the lambda function, but doing it afterwards in a separate step wouldn't be the end of the world.

If you have any suggestions for how to do the overall task (not just the specific "concatenate" step I described) that would be helpful as well. I've already tried doing this in a for loop using iterrows but that is taking way too long as the size of df1 is about 50K rows, the average length of the list in col_a is 300, and the average length of the list in col_b is 30.

Any help would be appreciated!

The easiest way is probably also the "dumbest": just take products of each row, put them all together, and feed them to pd.MultiIndex.from_tuples .

import itertools
rowwise_products = df.apply(
    lambda row: list(itertools.product(*row)),
    axis=1
)
all_tuples = rowwise_products.sum()  # list concatenation

>>> pd.MultiIndex.from_tuples(all_tuples, names=df.columns)
MultiIndex([('A1', 'B1'),
            ('A2', 'B1'),
            ('A3', 'B1'),
            ('A3', 'B2'),
            ('A3', 'B3')],
           names=['col_a', 'col_b'])

Perhaps start with simple itertools version of product . Use product(*row) to generate the cartesian product (also works if you had more than 2 columns). Then sum() and Multiindex.from_tuples :

pd.MultiIndex.from_tuples(df.apply(lambda r: list(product(*r)), axis=1).sum())

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM