简体   繁体   中英

How can I get unique words from a DataFrame column of strings?

I'm looking for a way to get a list of unique words in a column of strings in a DataFrame.

import pandas as pd
import numpy as np

df = pd.read_csv('FinalStemmedSentimentAnalysisDataset.csv', sep=';',dtype= 
       {'tweetId':int,'tweetText':str,'tweetDate':str,'sentimentLabel':int})

tweets = {}
tweets[0] = df[df['sentimentLabel'] == 0]
tweets[1] = df[df['sentimentLabel'] == 1]

the dataset I'm using is from this link: http://thinknook.com/twitter-sentiment-analysis-training-corpus-dataset-2012-09-22/

I got this column with strings of variable length, and i want to get the list of every unique word in the column and its count, how can i get it? I'm using Pandas in python and the original database has more then 1M rows so i also need some efective way to process this fast enough and not make the code be running for too long.

a Example of column could be:

  • is so sad for my apl friend.

  • omg this is terrible.

  • what is this new song?

    And the list could be something like.

[is,so,sad,for,my,apl,friend,omg,this,terrible,what,new,song]

if you have strings in column then you would have to split every sentence into list of words and then put all list in one list - you can use it sum() for this - it should give you all words. To get unique words you can convert it to set() - and later you can convert back to list()

But at start you would have to clean sentences to remove chars like . , ? , etc. I uses regex to keep only some chars and space. Eventually you would have to convert all words into lower or upper case.

import pandas as pd

df = pd.DataFrame({
    'sentences': [
        'is so sad for my apl friend.',
        'omg this is terrible.',
        'what is this new song?',
    ]
})

unique = set(df['sentences'].str.replace('[^a-zA-Z ]', '').str.lower().str.split(' ').sum())

print(list(sorted(unique)))

Result

['apl', 'for', 'friend', 'is', 'my', 'new', 'omg', 'sad', 'so', 'song', 'terrible', 'this', 'what']

EDIT: as @HenryYik mentioned in comment - findall('\\w+') can be used instead of split() but also instead of replace()

unique = set(df['sentences'].str.lower().str.findall("\w+").sum())

EDIT: I tested it with data from

http://thinknook.com/twitter-sentiment-analysis-training-corpus-dataset-2012-09-22/

All works fast except column.sum() or sum(column) - I measured time for 1000 rows and calculated for 1 500 000 rows and it would need 35 minutes.

Much faster is to use itertools.chain() - it would need about 8 seconds.

import itertools

words = df['sentences'].str.lower().str.findall("\w+")
words = list(itertools.chain(words))
unique = set(words)

but it can be converterd to set() directly.

words = df['sentences'].str.lower().str.findall("\w+")

unique = set()

for x in words:
    unique.update(x)

and it takes about 5 seconds


Full code:

import pandas as pd
import time 

print(time.strftime('%H:%M:%S'), 'start')

print('-----')
#------------------------------------------------------------------------------

start = time.time()

# `read_csv()` can read directly from internet and compressed to zip
#url = 'http://thinknook.com/wp-content/uploads/2012/09/Sentiment-Analysis-Dataset.zip'
url = 'SentimentAnalysisDataset.csv'

# need to skip two rows which are incorrect
df = pd.read_csv(url, sep=',', dtype={'ItemID':int, 'Sentiment':int, 'SentimentSource':str, 'SentimentText':str}, skiprows=[8835, 535881])

end = time.time()
print(time.strftime('%H:%M:%S'), 'load:', end-start, 's')

print('-----')
#------------------------------------------------------------------------------

start = end

words = df['SentimentText'].str.lower().str.findall("\w+")
#df['words'] = words

end = time.time()
print(time.strftime('%H:%M:%S'), 'words:', end-start, 's')

print('-----')
#------------------------------------------------------------------------------

start = end

unique = set()
for x in words:
    unique.update(x)

end = time.time()
print(time.strftime('%H:%M:%S'), 'set:', end-start, 's')

print('-----')
#------------------------------------------------------------------------------

print(list(sorted(unique))[:10])

Result

00:27:04 start
-----
00:27:08 load: 4.10780930519104 s
-----
00:27:23 words: 14.803470849990845 s
-----
00:27:27 set: 4.338541269302368 s
-----
['0', '00', '000', '0000', '00000', '000000000000', '0000001', '000001', '000014', '00004873337e0033fea60']

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM