简体   繁体   English

如何用 Pandas 数据框中的 NaN 替换所有非数字条目?

[英]How to replace all non-numeric entries with NaN in a pandas dataframe?

I have various csv files and I import them as a DataFrame.我有各种 csv 文件,并将它们作为 DataFrame 导入。 The problem is that many files use different symbols for missing values.问题是许多文件对缺失值使用不同的符号。 Some use nan, others NaN, ND, None, missing etc. or just live the entry empty.有些使用 nan,其他使用 NaN、ND、None、missing 等,或者只是将条目留空。 Is there a way to replace all these values with a np.nan?有没有办法用 np.nan 替换所有这些值? In other words, any non-numeric value in the dataframe becomes np.nan.换句话说,数据帧中的任何非数字值都会变成 np.nan。 Thank you for the help.谢谢你的帮助。

I found what I think is a relatively elegant but also robust method:我发现我认为是一种相对优雅但也很健壮的方法:

def isnumber(x):
    try:
        float(x)
        return True
    except:
        return False

df[df.applymap(isnumber)]

In case it's not clear: You define a function that returns True only if whatever input you have can be converted to a float.如果不清楚:您定义了一个函数,该函数仅在您拥有的任何输入可以转换为浮点数时才返回True You then filter df with that boolean dataframe, which automatically assigns NaN to the cells you didn't filter for.然后,您使用该布尔数据框过滤df ,它会自动将NaN分配给您未过滤的单元格。

Another solution I tried was to define isnumber as我尝试的另一个解决方案是将isnumber定义为

import number
def isnumber(x):
    return isinstance(x, number.Number)

but what I liked less about that approach is that you can accidentally have a number as a string, so you would mistakenly filter those out.但我不太喜欢这种方法的一点是,您可能会不小心将数字作为字符串,因此您会错误地将它们过滤掉。 This is also a sneaky error, seeing that the dataframe displays the string "99" the same as the number 99 .这也是一个偷偷摸摸的错误,因为数据帧显示的字符串"99"与数字99相同。

EDIT:编辑:

In your case you probably still need to df = df.applymap(float) after filtering, for the reason that float works on all different capitalizations of 'nan' , but until you explicitely convert them they will still be considered strings in the dataframe.在您的情况下,您可能仍然需要在过滤后df = df.applymap(float) ,因为float适用于'nan'所有不同大写字母,但在您明确转换它们之前,它们仍将被视为数据框中的字符串。

Replacing non-numeric entries on read, the easier (more safe) way在读取时替换非数字条目,更简单(更安全)的方式

TL;DR: Set a datatype for the column(s) that aren't casting properly, and supply a list of na_values TL;DR:为未正确转换的列设置数据类型,并提供 na_values 列表

# Create a custom list of values I want to cast to NaN, and explicitly 
#   define the data types of columns:
na_values = ['None', '(S)', 'S']
last_names = pd.read_csv('names_2010_census.csv', dtype={'pctapi': np.float64}, na_values=na_values)

Longer Explanation更长的解释

I believe best practices when working with messy data is to:我认为处理凌乱数据时的最佳做法是:

  • Provide datatypes to pandas for columns whose datatypes are not inferred properly.为未正确推断数据类型的列向 Pandas 提供数据类型。
  • Explicitly define a list of values that should be cast to NaN.显式定义应转换为 NaN 的值列表。

This is quite easy to do.这很容易做到。

Pandas read_csv has a list of values that it looks for and automatically casts to NaN when parsing the data (see the documentation of read_csv for the list). Pandas read_csv有一个它查找的值列表,并在解析数据时自动转换为 NaN(请参阅read_csv文档read_csv列表)。 You can extend this list using the na_values parameter, and you can tell pandas how to cast particular columns using the dtypes parameter.您可以使用 na_values 参数扩展此列表,并且可以使用 dtypes 参数告诉 pandas 如何转换特定列。

In the example above, pctapi is the name of a column that was casting to object type instead of float64, due to NaN values.在上面的示例中,由于 NaN 值, pctapi是转换为对象类型而不是 float64 的列的名称。 So, I force pandas to cast to float64 and provide the read_csv function with a list of values to cast to NaN .因此,我强制熊猫强制转换为 float64 并提供 read_csv 函数以及要转换为NaN的值列表。

Process I follow我遵循的过程

Since data science is often completely about process, I thought I describe the steps I use to create an na_values list and debug this issue with a dataset.由于数据科学通常完全与过程有关,因此我想我描述了用于创建 na_values 列表和使用数据集调试此问题的步骤。

Step 1: Try to import the data and let pandas infer data types.第一步:尝试导入数据,让pandas推断数据类型。 Check if the data types are as expected.检查数据类型是否符合预期。 If they are = move on.如果他们是 = 继续前进。

在此处输入图片说明

In the example above, Pandas was right on about half the columns.在上面的例子中,Pandas 在大约一半的列上是正确的。 However, I expected all columns listed below the 'count' field to be of type float64.但是,我希望“计数”字段下方列出的所有列都是 float64 类型。 We'll need to fix this.我们需要解决这个问题。

Step 2: If data types are not as expected, explicitly set the data types on read using dtypes parameter.第 2 步:如果数据类型不符合预期,请使用 dtypes 参数在读取时显式设置数据类型。 This will throw errors by default on values that cannot be cast.默认情况下,这将在无法转换的值上引发错误。

# note: the dtypes dictionary specifying types. pandas will attempt to infer
#   the type of any column name that's not listed
last_names = pd.read_csv('names_2010_census.csv', dtype={'pctwhite': np.float64})

Here's the error message I receive when running the code above:这是我在运行上面的代码时收到的错误消息: 在此处输入图片说明

Step 3: Create an explicit list of values pandas cannot convert and cast them to NaN on read.第 3 步:创建一个 Pandas 无法转换的显式值列表,并在读取时将它们转换为 NaN。

From the error message, I can see that pandas was unable to cast the value of (S) .从错误消息中,我可以看到 pandas 无法转换(S)的值。 I add this to my list of na_values:我将此添加到我的 na_values 列表中:

# note the new na_values argument provided to read_csv
last_names = pd.read_csv('names_2010_census.csv', dtype={'pctwhite': np.float64}, na_values=['(S)'])

Finally, I repeat steps 2 & 3 until I have a comprehensive list of dtype mappings and na_values.最后,我重复第 2 步和第 3 步,直到我有一个完整的 dtype 映射和 na_values 列表。

If you're working on a hobbyist project this method may be more than you need, you may want to use u/instant's answer instead.如果您正在从事业余项目,则此方法可能超出您的需要,您可能想改用 u/instant 的答案。 However, if you're working in production systems or on a team, it's well worth the 10 minutes it takes to correctly cast your columns.但是,如果您在生产系统或团队中工作,那么花 10 分钟正确投射您的列是非常值得的。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM