简体   繁体   中英

Count individual words in Pandas data frame

I'm trying to count the individual words in a column of my data frame. It looks like this. In reality the texts are Tweets.

text
this is some text that I want to count
That's all I wan't
It is unicode text

So what I found from other stackoverflow questions is that I could use the following:

Count most frequent 100 words from sentences in Dataframe Pandas

Count distinct words from a Pandas Data Frame

My df is called result and this is my code:

from collections import Counter
result2 = Counter(" ".join(result['text'].values.tolist()).split(" ")).items()
result2

I get the following error:

TypeError                                 Traceback (most recent call last)
<ipython-input-6-2f018a9f912d> in <module>()
      1 from collections import Counter
----> 2 result2 = Counter(" ".join(result['text'].values.tolist()).split(" ")).items()
      3 result2
TypeError: sequence item 25831: expected str instance, float found

The dtype of text is object, which from what I understand is correct for unicode text data.

The issue is occurring because some of the values in your series ( result['text'] ) is of type float . If you want to consider them during ' '.join() as well, then you would need to convert the floats to string before passing them onto str.join() .

You can use Series.astype() to convert all the values to string. Also, you really do not need to use .tolist() , you can simply give the series to str.join() as well. Example -

result2 = Counter(" ".join(result['text'].astype(str)).split(" ")).items()

Demo -

In [60]: df = pd.DataFrame([['blah'],['asd'],[10.1]],columns=['A'])

In [61]: df
Out[61]:
      A
0  blah
1   asd
2  10.1

In [62]: ' '.join(df['A'])
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-62-77e78c2ee142> in <module>()
----> 1 ' '.join(df['A'])

TypeError: sequence item 2: expected str instance, float found

In [63]: ' '.join(df['A'].astype(str))
Out[63]: 'blah asd 10.1'

In the end I went with the following code:

pd.set_option('display.max_rows', 100)
words = pd.Series(' '.join(result['text'].astype(str)).lower().split(" ")).value_counts()[:100]
words

The problem was however solved by Anand S Kumar.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM