简体   繁体   English

标记数据帧内容时如何处理NA值?

[英]How to handle NA values when tokenizing the contents of a data frame?

I have a pandas dataframe and I am trying to tokenize the contents of each row. 我有一个pandas数据框,我正在尝试标记每一行的内容。

import pandas as pd
import nltk as nk
from nltk import word_tokenize

TextData = pd.read_csv('TextData.csv')
TextData['tokenized_summary'] = TextData.apply(lambda row: nk.word_tokenize(row['Summary']), axis=1)

When I run it, I get an error at line 67, 运行它时,在第67行出现错误,

TypeError: ('expected string or buffer', u'occurred at index 67') 

Which I think I am getting because the value for 'Summary' at iloc[67] is an NA value. 我想得到的是因为iloc [67]上“ Summary”的值是一个NA值。

TextData.Summary.iloc[67]

Out[45]: nan

Assuming it is the na value which is causing this, is there a way to tell word_tokenize or pandas to ignore the NA values whenever it comes across them? 假定是造成此问题的na值,有没有办法告诉word_tokenize或pandas每当遇到NA值时就忽略它们?

Else, what else might be causing this? 否则,可能是什么原因引起的?

You can use fillna() to replace NaN with a specified value: 您可以使用fillna()将NaN替换为指定值:

import pandas as pd
import nltk as nk
from nltk import word_tokenize

TextData = pd.read_csv('TextData.csv')
TextData.fillna('some value') # or just: TextData['Summary'].fillna('some value')
TextData['tokenized_summary'] = TextData.apply(lambda row: nk.word_tokenize(row['Summary']), axis=1)

Previous Answer 上一个答案

You can simply "eliminate" the records where that value is null: 您可以简单地“消除”该值为null的记录:

TextData = TextData[TextData['tokenized_summary'].notnull()]

Making the final product look like: 使最终产品看起来像:

import pandas as pd
import nltk as nk
from nltk import word_tokenize

TextData = pd.read_csv('TextData.csv')
TextData = TextData[TextData['tokenized_summary'].notnull()]
TextData['tokenized_summary'] = TextData.apply(lambda row: nk.word_tokenize(row['Summary']), axis=1)

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何将 xml 节点和关键值提取到 R studio 中的 data.frame,包括 NA 值? - How can I extract xml nodes and key values to data.frame in R studio, including NA values? 如何使用python中的scikit Learn通过线性回归预测在大熊猫数据框中填充NA值? - How to fill NA values in a pandas Data Frame with linear regression prediction using scikit learn in Python? 标记句子并计算熊猫数据框中的数字 - tokenizing sentences and counting the number in a pandas data frame 如何在我的数据框中找到缺失值,处理这些缺失值的最佳方法是什么? - how can i find the missing values in my data frame and what is the best method for handle this missing values? 如何处理 Python 数据框中包含日期、数字、字符串值的列 - How to handle a column which contains date , number, string values in Python Data Frame 如何通过标记现有数据帧的内容来创建新数据帧? - How to create new dataframe by tokenizing contents of existing dataframe? 如何处理 CSV 字典中的“缺失键值”并处理 Pandas 数据框? - How to handle 'missing key values' in CSV dictionary and working through Pandas data frame? 如何处理数据框中的不同日期格式 - How to handle different date formats in a data frame Pandas:如何解决“错误标记数据”? - Pandas: How to workaround "error tokenizing data"? .loc 数据框导致值错误无法将非有限值(NA 或 inf)转换为整数 - .loc data frame causes value Error Cannot convert non-finite values (NA or inf) to integer
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM