简体   繁体   中英

Python pandas explode (one to many relationship)

Suppose I have a the following dataframe with columns name, preference, fruits :

name   preference   fruits
adam    likes       apples
mike   dislikes     orange

If the dataframe above had a one to many relationship like column name would have have multiple relationship with column preference, fruits . For example the output dataframe I am looking for is:

name   preference   fruits
adam    likes       apples
adam    likes       orange
adam    dislikes    apple
adam    dislikes    orange
mike    likes       apples
mike    likes       orange
mike    dislikes    apple
mike    dislikes    orange

Wondering if it is possible. From my knowledge about pandas so far I believe I will have to use groupby? Any help is appreciated! Thanks!

Is it just cross product:

(pd.MultiIndex.from_product([df[col] for col in df],
                           names=df.columns)
   .to_frame().reset_index(drop=True)
)

Output:

   name preference  fruits
0  adam      likes  apples
1  adam      likes  orange
2  adam   dislikes  apples
3  adam   dislikes  orange
4  mike      likes  apples
5  mike      likes  orange
6  mike   dislikes  apples
7  mike   dislikes  orange

I'd use itertools.product

import pandas as pd
from itertools import product


df = pd.DataFrame({
    'name': ['adam', 'mike'],
    'preference': ['likes', 'dislikes'],
    'fruits': ['apples', 'oranges']
})

ndf = pd.DataFrame(
    product(*[df[c] for c in df.columns]),
    columns=df.columns
)

print(ndf)
#    name preference   fruits
# 0  adam      likes   apples
# 1  adam      likes  oranges
# 2  adam   dislikes   apples
# 3  adam   dislikes  oranges
# 4  mike      likes   apples
# 5  mike      likes  oranges
# 6  mike   dislikes   apples
# 7  mike   dislikes  oranges

As for speed, this seems to be a bit faster as well.

%%timeit
ndf = pd.DataFrame(
    product(*[df[c] for c in df.columns]),
    columns=df.columns
)
# 624 µs ± 32.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


%%timeit
(pd.MultiIndex.from_product([df[col] for col in df],
                           names=df.columns)
   .to_frame().reset_index(drop=True)
)
# 3.51 ms ± 176 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM