[英]Count individual words in Pandas data frame
I'm trying to count the individual words in a column of my data frame. 我正在尝试计算数据框中一列中的各个单词。 It looks like this.
看起来像这样。 In reality the texts are Tweets.
实际上,这些文本是推文。
text
this is some text that I want to count
That's all I wan't
It is unicode text
So what I found from other stackoverflow questions is that I could use the following: 所以我从其他stackoverflow问题中发现,我可以使用以下代码:
Count most frequent 100 words from sentences in Dataframe Pandas 计算Dataframe Pandas中句子中最常见的100个单词
Count distinct words from a Pandas Data Frame 计算熊猫数据框中的不同单词
My df is called result and this is my code: 我的df称为result,这是我的代码:
from collections import Counter
result2 = Counter(" ".join(result['text'].values.tolist()).split(" ")).items()
result2
I get the following error: 我收到以下错误:
TypeError Traceback (most recent call last)
<ipython-input-6-2f018a9f912d> in <module>()
1 from collections import Counter
----> 2 result2 = Counter(" ".join(result['text'].values.tolist()).split(" ")).items()
3 result2
TypeError: sequence item 25831: expected str instance, float found
The dtype of text is object, which from what I understand is correct for unicode text data. 文本的dtype是object,据我所知,它对于unicode文本数据是正确的。
The issue is occurring because some of the values in your series ( result['text']
) is of type float
. 由于您系列中的某些值(
result['text']
)的类型为float
。 If you want to consider them during ' '.join()
as well, then you would need to convert the floats to string before passing them onto str.join()
. 如果您也想在
' '.join()
期间考虑它们,则需要先将浮点数转换为字符串,然后再将它们传递给str.join()
。
You can use Series.astype()
to convert all the values to string. 您可以使用
Series.astype()
将所有值转换为字符串。 Also, you really do not need to use .tolist()
, you can simply give the series to str.join()
as well. 另外,您实际上不需要使用
.tolist()
,也可以直接将序列赋予str.join()
。 Example - 范例-
result2 = Counter(" ".join(result['text'].astype(str)).split(" ")).items()
Demo - 演示-
In [60]: df = pd.DataFrame([['blah'],['asd'],[10.1]],columns=['A'])
In [61]: df
Out[61]:
A
0 blah
1 asd
2 10.1
In [62]: ' '.join(df['A'])
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-62-77e78c2ee142> in <module>()
----> 1 ' '.join(df['A'])
TypeError: sequence item 2: expected str instance, float found
In [63]: ' '.join(df['A'].astype(str))
Out[63]: 'blah asd 10.1'
In the end I went with the following code: 最后,我使用了以下代码:
pd.set_option('display.max_rows', 100)
words = pd.Series(' '.join(result['text'].astype(str)).lower().split(" ")).value_counts()[:100]
words
The problem was however solved by Anand S Kumar. 但是,该问题已由Anand S Kumar解决。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.