I have a dataframe as below:
df = pd.DataFrame({'a':[1,2,3,4],
'b':["west, east", "east, north","south, west","east, south"]})
a b
0 1 west, east
1 2 east, north
2 3 south, west
3 4 east, south
I want to get unique strings from column b
like below.
Expected Output:
["east", "west", "north", "south"] # order doesn't matter here
My Efforts
op = []
for _, value in df['b'].items():
op.extend(value)
op = set(op)
Which is giving me correct result but is there more efficient way to do so?
My original dataset has around million rows and 1000s of unqiue values.
You can create long string by join
, then split
and convert to set
, last to list
:
a = list(set(', '.join(df['b']).split(', ')))
print (a)
['south', 'north', 'west', 'east']
Or use set comprehension
with split
and flattening:
a = list(set([y for x in df['b'] for y in x.split(', ')]))
print (a)
['south', 'north', 'west', 'east']
Pure pandas solution is use Series.str.split
, DataFrame.stack
, Series.unique
and convert to list
:
a = df.b.str.split(', ', expand=True).stack().unique().tolist()
You'd have to profile your code to determine if this is quicker for your particular usecase, but using pandas in-built vectorised methods might show some benefit on larger data sets.
Try using a combination of Series.str.split() and Series.unique().
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.split.html https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.unique.html
# Split column of lists into strings
df_split = df['b'].str.rsplit(',', n=-1, expand=True)
# For each column, get unique values and append to set
uniques = set()
for col in df_split:
uniques.update(df_split[col].unique())
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.