[英]Python & Pandas: 'Series' objects are mutable, thus they cannot be hashed
With Python and Pandas, I'm seeking to write a script that takes the data from the text
column, evaluates that text with the textstat module, and then write the results back into the csv under the word_count
column.使用 Python 和 Pandas,我试图编写一个脚本,该脚本从
text
列中获取数据,使用 textstat 模块评估该文本,然后将结果写回word_count
列下的 csv。
Here is the structure of the csv:这是csv的结构:
user_id text text_number word_count
0 10 test text A text_0 NaN
1 11 NaN NaN NaN
2 12 NaN NaN NaN
3 13 NaN NaN NaN
4 14 NaN NaN NaN
5 15 test text B text_1 NaN
Here is my code attempt to loop the text
column into textstat:这是我尝试将
text
列循环到 textstat 的代码:
df = pd.read_csv("texts.csv").fillna('')
text_data = df["text"]
length1 = len(text_data)
for x in range(length1):
(text_data[x])
#this is the textstat word count operation
word_count = textstat.lexicon_count(text_data, removepunct=True)
output_df = pd.DataFrame({"word_count":[word_count]})
output_df.to_csv('texts.csv', mode="a", header=False, index=False)
However, I recieve this error:但是,我收到此错误:
TypeError: 'Series' objects are mutable, thus they cannot be hashed
Any suggestions on how to proceed?有关如何进行的任何建议? All assistance appreciated.
感谢所有帮助。
The more pandas
approach would be to use fillna
+ apply
.更多的
pandas
方法是使用fillna
+ apply
。 Then write the Series
directly out to_csv
:然后将
Series
直接写出to_csv
:
(
df["text"].fillna('') # Replace NaN with empty String
.apply(textstat.lexicon_count,
removepunct=True) # Call lexicon_count on each value
.rename('word_count') # Rename Series
.to_csv('texts.csv', mode="a", index=False) # Write to csv
)
texts.csv:文本.csv:
word_count
1
0
0
0
0
1
To add a column to the existing DataFrame/csv instead of appending to the end of it can also do:要将一列添加到现有的 DataFrame/csv 而不是附加到它的末尾也可以这样做:
df['word_count'] = (
df["text"].fillna('') # Replace NaN with empty String
.apply(textstat.lexicon_count,
removepunct=True) # Call lexicon_count on each value
)
df.to_csv('texts.csv', index=False) # Write to csv
texts.csv:文本.csv:
user_id,text,text_number,word_count
text,A,text_0,1
,,,0
,,,0
,,,0
,,,0
text,B,text_1,1
To fix the current implementation, also use fillna
and conditionally write the header only on the first iteration:要修复当前实现,还可以使用
fillna
并仅在第一次迭代时有条件地写入标头:
text_data = df["text"].fillna('')
for i, x in enumerate(text_data):
# this is the textstat word count operation
word_count = textstat.lexicon_count(x, removepunct=True)
output_df = pd.DataFrame({"word_count": [word_count]})
output_df.to_csv('texts.csv', mode="a", header=(i == 0), index=False)
texts.csv:文本.csv:
word_count
1
0
0
0
0
1
DataFrame and imports:数据框和导入:
import pandas as pd
import textstat
from numpy import nan
df = pd.DataFrame({
'user_id': ['text', nan, nan, nan, nan, 'text'],
'text': ['A', nan, nan, nan, nan, 'B'],
'text_number': ['text_0', nan, nan, nan, nan, 'text_1'],
'word_count': [nan, nan, nan, nan, nan, nan]
})
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.