I have following column in a dataframe
which contains colors seprated by |
df = pd.DataFrame({'x': ['RED|BROWN|YELLOW', 'WHITE|BLACK|YELLOW|GREEN', 'BLUE|RED|PINK']})
I want to find all unique colors from the column.
Expected Output :
{'YELLOW', 'BLACK', 'RED', 'BLUE', 'BROWN', 'GREEN', 'WHITE', 'PINK'}
I don't mind if it is list
or set
.
What I tried :
df['x'] = df['x'].apply(lambda x: x.split("|"))
colors = []
for idx, row in df.iterrows():
colors.extend(row['x'])
print(set(colors))
Which is working fine but I am looking for more efficient solution as I have large dataset.
set(df.loc[:, 'x'].str.split('|', expand=True).values.ravel())
要么
set(df.loc[:, 'x'].str.split('|', expand=True).values.ravel()) - set([None])
list(df.x.str.split('|', expand=True).stack().reset_index(name='x').drop_duplicates('x')['x'])
产量
['RED', 'BROWN', 'YELLOW', 'WHITE', 'BLACK', 'GREEN', 'BLUE', 'PINK']
Use itertools
(which is arguably the fastest in flattening lists ) with set;
import itertools
set(itertools.chain.from_iterable(df.x.str.split('|')))
Output:
{'BLACK', 'BLUE', 'BROWN', 'GREEN', 'PINK', 'RED', 'WHITE', 'YELLOW'}
Another possible solution with functools
which is almost as fast as itertools:
import functools
import operator
set(functools.reduce(operator.iadd, df.x.str.split('|'), []))
Note you can also use sum()
which seems readable but not quite as fast.
You can also do set(df['x'].str.split('|').values.sum())
This will also remove None
form the output
{'YELLOW', 'RED', 'WHITE', 'BROWN', 'GREEN', 'PINK', 'BLUE', 'BLACK'}
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.