計算熊貓數據框中的單個單詞

Question

我正在嘗試計算數據框中一列中的各個單詞。 看起來像這樣。 實際上，這些文本是推文。

text
this is some text that I want to count
That's all I wan't
It is unicode text

所以我從其他stackoverflow問題中發現，我可以使用以下代碼：

計算Dataframe Pandas中句子中最常見的100個單詞

計算熊貓數據框中的不同單詞

我的df稱為result，這是我的代碼：

from collections import Counter
result2 = Counter(" ".join(result['text'].values.tolist()).split(" ")).items()
result2

我收到以下錯誤：

TypeError                                 Traceback (most recent call last)
<ipython-input-6-2f018a9f912d> in <module>()
      1 from collections import Counter
----> 2 result2 = Counter(" ".join(result['text'].values.tolist()).split(" ")).items()
      3 result2
TypeError: sequence item 25831: expected str instance, float found

文本的dtype是object，據我所知，它對於unicode文本數據是正確的。

Answer 1

由於您系列中的某些值（ result['text'] ）的類型為float 。 如果您也想在' '.join()期間考慮它們，則需要先將浮點數轉換為字符串，然后再將它們傳遞給str.join() 。

您可以使用Series.astype()將所有值轉換為字符串。 另外，您實際上不需要使用.tolist() ，也可以直接將序列賦予str.join() 。 范例-

result2 = Counter(" ".join(result['text'].astype(str)).split(" ")).items()

演示-

In [60]: df = pd.DataFrame([['blah'],['asd'],[10.1]],columns=['A'])

In [61]: df
Out[61]:
      A
0  blah
1   asd
2  10.1

In [62]: ' '.join(df['A'])
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-62-77e78c2ee142> in <module>()
----> 1 ' '.join(df['A'])

TypeError: sequence item 2: expected str instance, float found

In [63]: ' '.join(df['A'].astype(str))
Out[63]: 'blah asd 10.1'

Answer 2

最后，我使用了以下代碼：

pd.set_option('display.max_rows', 100)
words = pd.Series(' '.join(result['text'].astype(str)).lower().split(" ")).value_counts()[:100]
words

但是，該問題已由Anand S Kumar解決。

計算熊貓數據框中的單個單詞

問題描述

2 個解決方案

解決方案1
7 已采納 2015-10-20 16:27:25

解決方案2
2 2015-10-20 17:03:33

計算熊貓數據框中的單個單詞

問題描述

2 個解決方案

解決方案1 7 已采納 2015-10-20 16:27:25

解決方案2 2 2015-10-20 17:03:33

解決方案1
7 已采納 2015-10-20 16:27:25

解決方案2
2 2015-10-20 17:03:33