How to get unique substrings in pandas column

Question

I have a dataframe as below:

df = pd.DataFrame({'a':[1,2,3,4], 
                   'b':["west, east", "east, north","south, west","east, south"]})

   a            b                                                                                                                     
0  1   west, east                                                                                                                     
1  2  east, north                                                                                                                     
2  3  south, west                                                                                                                     
3  4  east, south

I want to get unique strings from column b like below.

Expected Output:

["east", "west", "north", "south"]   # order doesn't matter here

My Efforts

op = []
for _, value in df['b'].items():
    op.extend(value)

op = set(op)

Which is giving me correct result but is there more efficient way to do so?

My original dataset has around million rows and 1000s of unqiue values.

Answer 1

You can create long string by join , then split and convert to set , last to list :

a = list(set(', '.join(df['b']).split(', ')))
print (a)
['south', 'north', 'west', 'east']

Or use set comprehension with split and flattening:

a = list(set([y for x in df['b'] for y in x.split(', ')]))
print (a)
['south', 'north', 'west', 'east']

Pure pandas solution is use Series.str.split , DataFrame.stack , Series.unique and convert to list :

a = df.b.str.split(', ', expand=True).stack().unique().tolist()

Answer 2

You'd have to profile your code to determine if this is quicker for your particular usecase, but using pandas in-built vectorised methods might show some benefit on larger data sets.

Try using a combination of Series.str.split() and Series.unique().

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.split.html https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.unique.html

# Split column of lists into strings
df_split = df['b'].str.rsplit(',', n=-1, expand=True)

# For each column, get unique values and append to set
uniques = set()
for col in df_split:
    uniques.update(df_split[col].unique())

How to get unique substrings in pandas column

Question

2 answers

solution1
5 ACCPTED 2019-04-12 08:10:18

solution2
0 2019-04-12 08:28:54

How to get unique substrings in pandas column

Question

2 answers

solution1 5 ACCPTED 2019-04-12 08:10:18

solution2 0 2019-04-12 08:28:54

solution1
5 ACCPTED 2019-04-12 08:10:18

solution2
0 2019-04-12 08:28:54