[英]How to handle NA values when tokenizing the contents of a data frame?
I have a pandas dataframe and I am trying to tokenize the contents of each row. 我有一个pandas数据框,我正在尝试标记每一行的内容。
import pandas as pd
import nltk as nk
from nltk import word_tokenize
TextData = pd.read_csv('TextData.csv')
TextData['tokenized_summary'] = TextData.apply(lambda row: nk.word_tokenize(row['Summary']), axis=1)
When I run it, I get an error at line 67, 运行它时,在第67行出现错误,
TypeError: ('expected string or buffer', u'occurred at index 67')
Which I think I am getting because the value for 'Summary' at iloc[67] is an NA value. 我想得到的是因为iloc [67]上“ Summary”的值是一个NA值。
TextData.Summary.iloc[67]
Out[45]: nan
Assuming it is the na value which is causing this, is there a way to tell word_tokenize or pandas to ignore the NA values whenever it comes across them? 假定是造成此问题的na值,有没有办法告诉word_tokenize或pandas每当遇到NA值时就忽略它们?
Else, what else might be causing this? 否则,可能是什么原因引起的?
You can use fillna()
to replace NaN with a specified value: 您可以使用
fillna()
将NaN替换为指定值:
import pandas as pd
import nltk as nk
from nltk import word_tokenize
TextData = pd.read_csv('TextData.csv')
TextData.fillna('some value') # or just: TextData['Summary'].fillna('some value')
TextData['tokenized_summary'] = TextData.apply(lambda row: nk.word_tokenize(row['Summary']), axis=1)
You can simply "eliminate" the records where that value is null: 您可以简单地“消除”该值为null的记录:
TextData = TextData[TextData['tokenized_summary'].notnull()]
Making the final product look like: 使最终产品看起来像:
import pandas as pd
import nltk as nk
from nltk import word_tokenize
TextData = pd.read_csv('TextData.csv')
TextData = TextData[TextData['tokenized_summary'].notnull()]
TextData['tokenized_summary'] = TextData.apply(lambda row: nk.word_tokenize(row['Summary']), axis=1)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.