简体   繁体   中英

Finding unique values in pandas column where each row has multiple values

I have following column in a dataframe which contains colors seprated by |

df = pd.DataFrame({'x': ['RED|BROWN|YELLOW', 'WHITE|BLACK|YELLOW|GREEN', 'BLUE|RED|PINK']})

I want to find all unique colors from the column.

Expected Output :

{'YELLOW', 'BLACK', 'RED', 'BLUE', 'BROWN', 'GREEN', 'WHITE', 'PINK'}

I don't mind if it is list or set .

What I tried :

df['x'] = df['x'].apply(lambda x: x.split("|"))

colors = []
for idx, row in df.iterrows():
    colors.extend(row['x'])

print(set(colors))

Which is working fine but I am looking for more efficient solution as I have large dataset.

set(df.loc[:, 'x'].str.split('|', expand=True).values.ravel())

要么

set(df.loc[:, 'x'].str.split('|', expand=True).values.ravel()) - set([None])
list(df.x.str.split('|', expand=True).stack().reset_index(name='x').drop_duplicates('x')['x'])

产量

['RED', 'BROWN', 'YELLOW', 'WHITE', 'BLACK', 'GREEN', 'BLUE', 'PINK']

Use itertools (which is arguably the fastest in flattening lists ) with set;

import itertools
set(itertools.chain.from_iterable(df.x.str.split('|')))

Output:

{'BLACK', 'BLUE', 'BROWN', 'GREEN', 'PINK', 'RED', 'WHITE', 'YELLOW'}

Another possible solution with functools which is almost as fast as itertools:

import functools
import operator
set(functools.reduce(operator.iadd, df.x.str.split('|'), []))

Note you can also use sum() which seems readable but not quite as fast.

You can also do set(df['x'].str.split('|').values.sum())

This will also remove None form the output

{'YELLOW', 'RED', 'WHITE', 'BROWN', 'GREEN', 'PINK', 'BLUE', 'BLACK'}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM