简体   繁体   中英

How to get unique substrings in pandas column

I have a dataframe as below:

df = pd.DataFrame({'a':[1,2,3,4], 
                   'b':["west, east", "east, north","south, west","east, south"]})

   a            b                                                                                                                     
0  1   west, east                                                                                                                     
1  2  east, north                                                                                                                     
2  3  south, west                                                                                                                     
3  4  east, south            

I want to get unique strings from column b like below.

Expected Output:

["east", "west", "north", "south"]   # order doesn't matter here

My Efforts

op = []
for _, value in df['b'].items():
    op.extend(value)

op = set(op)

Which is giving me correct result but is there more efficient way to do so?

My original dataset has around million rows and 1000s of unqiue values.

You can create long string by join , then split and convert to set , last to list :

a = list(set(', '.join(df['b']).split(', ')))
print (a)
['south', 'north', 'west', 'east']

Or use set comprehension with split and flattening:

a = list(set([y for x in df['b'] for y in x.split(', ')]))
print (a)
['south', 'north', 'west', 'east']

Pure pandas solution is use Series.str.split , DataFrame.stack , Series.unique and convert to list :

a = df.b.str.split(', ', expand=True).stack().unique().tolist()

You'd have to profile your code to determine if this is quicker for your particular usecase, but using pandas in-built vectorised methods might show some benefit on larger data sets.

Try using a combination of Series.str.split() and Series.unique().

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.split.html https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.unique.html

# Split column of lists into strings
df_split = df['b'].str.rsplit(',', n=-1, expand=True)

# For each column, get unique values and append to set
uniques = set()
for col in df_split:
    uniques.update(df_split[col].unique())

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM