[英]How to replace all non-numeric entries with NaN in a pandas dataframe?
I have various csv files and I import them as a DataFrame.我有各种 csv 文件,并将它们作为 DataFrame 导入。 The problem is that many files use different symbols for missing values.
问题是许多文件对缺失值使用不同的符号。 Some use nan, others NaN, ND, None, missing etc. or just live the entry empty.
有些使用 nan,其他使用 NaN、ND、None、missing 等,或者只是将条目留空。 Is there a way to replace all these values with a np.nan?
有没有办法用 np.nan 替换所有这些值? In other words, any non-numeric value in the dataframe becomes np.nan.
换句话说,数据帧中的任何非数字值都会变成 np.nan。 Thank you for the help.
谢谢你的帮助。
I found what I think is a relatively elegant but also robust method:我发现我认为是一种相对优雅但也很健壮的方法:
def isnumber(x):
try:
float(x)
return True
except:
return False
df[df.applymap(isnumber)]
In case it's not clear: You define a function that returns True
only if whatever input you have can be converted to a float.如果不清楚:您定义了一个函数,该函数仅在您拥有的任何输入可以转换为浮点数时才返回
True
。 You then filter df
with that boolean dataframe, which automatically assigns NaN
to the cells you didn't filter for.然后,您使用该布尔数据框过滤
df
,它会自动将NaN
分配给您未过滤的单元格。
Another solution I tried was to define isnumber
as我尝试的另一个解决方案是将
isnumber
定义为
import number
def isnumber(x):
return isinstance(x, number.Number)
but what I liked less about that approach is that you can accidentally have a number as a string, so you would mistakenly filter those out.但我不太喜欢这种方法的一点是,您可能会不小心将数字作为字符串,因此您会错误地将它们过滤掉。 This is also a sneaky error, seeing that the dataframe displays the string
"99"
the same as the number 99
.这也是一个偷偷摸摸的错误,因为数据帧显示的字符串
"99"
与数字99
相同。
EDIT:编辑:
In your case you probably still need to df = df.applymap(float)
after filtering, for the reason that float
works on all different capitalizations of 'nan'
, but until you explicitely convert them they will still be considered strings in the dataframe.在您的情况下,您可能仍然需要在过滤后
df = df.applymap(float)
,因为float
适用于'nan'
所有不同大写字母,但在您明确转换它们之前,它们仍将被视为数据框中的字符串。
# Create a custom list of values I want to cast to NaN, and explicitly
# define the data types of columns:
na_values = ['None', '(S)', 'S']
last_names = pd.read_csv('names_2010_census.csv', dtype={'pctapi': np.float64}, na_values=na_values)
I believe best practices when working with messy data is to:我认为处理凌乱数据时的最佳做法是:
This is quite easy to do.这很容易做到。
Pandas read_csv
has a list of values that it looks for and automatically casts to NaN when parsing the data (see the documentation of read_csv
for the list). Pandas
read_csv
有一个它查找的值列表,并在解析数据时自动转换为 NaN(请参阅read_csv
的文档以read_csv
列表)。 You can extend this list using the na_values parameter, and you can tell pandas how to cast particular columns using the dtypes parameter.您可以使用 na_values 参数扩展此列表,并且可以使用 dtypes 参数告诉 pandas 如何转换特定列。
In the example above, pctapi
is the name of a column that was casting to object type instead of float64, due to NaN values.在上面的示例中,由于 NaN 值,
pctapi
是转换为对象类型而不是 float64 的列的名称。 So, I force pandas to cast to float64 and provide the read_csv function with a list of values to cast to NaN
.因此,我强制熊猫强制转换为 float64 并提供 read_csv 函数以及要转换为
NaN
的值列表。
Since data science is often completely about process, I thought I describe the steps I use to create an na_values list and debug this issue with a dataset.由于数据科学通常完全与过程有关,因此我想我描述了用于创建 na_values 列表和使用数据集调试此问题的步骤。
In the example above, Pandas was right on about half the columns.在上面的例子中,Pandas 在大约一半的列上是正确的。 However, I expected all columns listed below the 'count' field to be of type float64.
但是,我希望“计数”字段下方列出的所有列都是 float64 类型。 We'll need to fix this.
我们需要解决这个问题。
# note: the dtypes dictionary specifying types. pandas will attempt to infer
# the type of any column name that's not listed
last_names = pd.read_csv('names_2010_census.csv', dtype={'pctwhite': np.float64})
Here's the error message I receive when running the code above:这是我在运行上面的代码时收到的错误消息:
From the error message, I can see that pandas was unable to cast the value of (S)
.从错误消息中,我可以看到 pandas 无法转换
(S)
的值。 I add this to my list of na_values:我将此添加到我的 na_values 列表中:
# note the new na_values argument provided to read_csv
last_names = pd.read_csv('names_2010_census.csv', dtype={'pctwhite': np.float64}, na_values=['(S)'])
Finally, I repeat steps 2 & 3 until I have a comprehensive list of dtype mappings and na_values.最后,我重复第 2 步和第 3 步,直到我有一个完整的 dtype 映射和 na_values 列表。
If you're working on a hobbyist project this method may be more than you need, you may want to use u/instant's answer instead.如果您正在从事业余项目,则此方法可能超出您的需要,您可能想改用 u/instant 的答案。 However, if you're working in production systems or on a team, it's well worth the 10 minutes it takes to correctly cast your columns.
但是,如果您在生产系统或团队中工作,那么花 10 分钟正确投射您的列是非常值得的。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.