简体   繁体   中英

How can I get unique words from a DataFrame column of strings?

I'm looking for a way to get a list of unique words in a column of strings in a DataFrame.

import pandas as pd
import numpy as np

df = pd.read_csv('FinalStemmedSentimentAnalysisDataset.csv', sep=';',dtype= 

tweets = {}
tweets[0] = df[df['sentimentLabel'] == 0]
tweets[1] = df[df['sentimentLabel'] == 1]

the dataset I'm using is from this link: http://thinknook.com/twitter-sentiment-analysis-training-corpus-dataset-2012-09-22/

I got this column with strings of variable length, and i want to get the list of every unique word in the column and its count, how can i get it? I'm using Pandas in python and the original database has more then 1M rows so i also need some efective way to process this fast enough and not make the code be running for too long.

a Example of column could be:

  • is so sad for my apl friend.

  • omg this is terrible.

  • what is this new song?

    And the list could be something like.


if you have strings in column then you would have to split every sentence into list of words and then put all list in one list - you can use it sum() for this - it should give you all words. To get unique words you can convert it to set() - and later you can convert back to list()

But at start you would have to clean sentences to remove chars like . , ? , etc. I uses regex to keep only some chars and space. Eventually you would have to convert all words into lower or upper case.

import pandas as pd

df = pd.DataFrame({
    'sentences': [
        'is so sad for my apl friend.',
        'omg this is terrible.',
        'what is this new song?',

unique = set(df['sentences'].str.replace('[^a-zA-Z ]', '').str.lower().str.split(' ').sum())



['apl', 'for', 'friend', 'is', 'my', 'new', 'omg', 'sad', 'so', 'song', 'terrible', 'this', 'what']

EDIT: as @HenryYik mentioned in comment - findall('\\w+') can be used instead of split() but also instead of replace()

unique = set(df['sentences'].str.lower().str.findall("\w+").sum())

EDIT: I tested it with data from


All works fast except column.sum() or sum(column) - I measured time for 1000 rows and calculated for 1 500 000 rows and it would need 35 minutes.

Much faster is to use itertools.chain() - it would need about 8 seconds.

import itertools

words = df['sentences'].str.lower().str.findall("\w+")
words = list(itertools.chain(words))
unique = set(words)

but it can be converterd to set() directly.

words = df['sentences'].str.lower().str.findall("\w+")

unique = set()

for x in words:

and it takes about 5 seconds

Full code:

import pandas as pd
import time 

print(time.strftime('%H:%M:%S'), 'start')


start = time.time()

# `read_csv()` can read directly from internet and compressed to zip
#url = 'http://thinknook.com/wp-content/uploads/2012/09/Sentiment-Analysis-Dataset.zip'
url = 'SentimentAnalysisDataset.csv'

# need to skip two rows which are incorrect
df = pd.read_csv(url, sep=',', dtype={'ItemID':int, 'Sentiment':int, 'SentimentSource':str, 'SentimentText':str}, skiprows=[8835, 535881])

end = time.time()
print(time.strftime('%H:%M:%S'), 'load:', end-start, 's')


start = end

words = df['SentimentText'].str.lower().str.findall("\w+")
#df['words'] = words

end = time.time()
print(time.strftime('%H:%M:%S'), 'words:', end-start, 's')


start = end

unique = set()
for x in words:

end = time.time()
print(time.strftime('%H:%M:%S'), 'set:', end-start, 's')




00:27:04 start
00:27:08 load: 4.10780930519104 s
00:27:23 words: 14.803470849990845 s
00:27:27 set: 4.338541269302368 s
['0', '00', '000', '0000', '00000', '000000000000', '0000001', '000001', '000014', '00004873337e0033fea60']

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

粤ICP备18138465号  © 2020-2024 STACKOOM.COM