简体   繁体   English

Pandas Dataframe - 在所有列中用 None 替换 None-like 值

[英]Pandas Dataframe - Replacing None-like Values with None in All Columns

I need to clean up a dataframe whose columns come from different sources and have different types.我需要清理一个 dataframe,它的列来自不同的来源并且具有不同的类型。 This means that I can have, for example, string columns that contain "nan", "none", "NULL", (as a string instead of a None value).这意味着我可以拥有包含“nan”、“none”、“NULL”(作为字符串而不是 None 值)的字符串列。

My goal is to find all empty values and replace them with None.我的目标是找到所有空值并将它们替换为 None。 This works fine:这很好用:

for column in df.columns:
    for idx, row in df.iterrows():
        if (str(row[column]).lower() == "none") or if (str(row[column]).lower() == "nan") or (str(row[column]).lower() == "null"):
            df.at[row.name, column] = None

But it is obviously not the best or fastest way to do it.但这显然不是最好或最快的方法。 How can I take advantage of Pandas operations or list comprehensions to do this substitution?我如何利用 Pandas 操作或列表理解来进行此替换? Thanks!谢谢!

This seems to be a somewhat controversial topic (see eg this thread ) but it's often said that list comprehensions are more computationally efficient than for loops, especially when iterating over pandas dataframes.这似乎是一个有点争议的话题(参见例如这个线程),但人们常说列表理解比 for 循环在计算上更有效,尤其是在迭代 pandas 数据帧时。

I also prefer using list comprehensions stylistically as it leads to fewer levels of indentation from nested loops/if statements.我也更喜欢在风格上使用列表理解,因为它可以减少嵌套循环/if 语句的缩进级别。

Here's what it looks like for your use case:这是您的用例的样子:

for column in df.columns:
    vals_list = df[column].to_list()
    replaced = [None if str(x).lower() in ['nan', 'none', 'null'] else x for x in vals_list]
    df[column] = replaced

Simple approach, use isin and mask :简单的方法,使用isinmask

df = pd.DataFrame([[1,2,'nan'],
                   ['none',3,'NULL']])

df_clean = df.mask(df.isin(["nan", "none", "NULL"]))

Or, if you want to update in place:或者,如果您想就地更新:

df[df.isin(["nan", "none", "NULL"])] = float('nan')

Output: Output:

     0  1    2
0    1  2  NaN
1  NaN  3  NaN

A quick, and easy optimization:快速简单的优化:

for column in df.columns:
    for idx, row in df.iterrows():
        col = str(row[column]).lower()
        if (col == "none") or if (col == "nan") or (col == "null"):
            df.at[row.name, column] = None

No need to convert row[column] to a str and then iterate over each character 3 times.无需将row[column]转换为str然后遍历每个字符 3 次。

Shorter code:较短的代码:

its_none = ['none', 'nan', 'null']
for column in df.columns:
    for idx, row in df.iterrows():
        if str(row[column]).lower() in its_none:
            df.at[row.name, column] = None

Even shorter (I imagine you're expecting a number) and more optimized:更短(我想你期待一个数字)和更优化:

for column in df.columns:
    for idx, row in df.iterrows():
        if str(row[column]).lower().startswith('n'):
            df.at[row.name, column] = None

If you want to use numpy you could do this as well (if the values in the fields are truly a string)如果你想使用 numpy 你也可以这样做(如果字段中的值确实是一个字符串)

import pandas as pd
import numpy as np

df = pd.DataFrame({
    'name' : ['one', 'two', 'one', 'two'],
    'A' : ['null', 'none', 'empty', 'Keep']
})

df['A'] = np.where(df['A'].isin(['null', 'none', 'empty']), '', df['A'])
df

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM