Pandas Dataframe - 在所有列中用 None 替换 None-like 值

Question

I need to clean up a dataframe whose columns come from different sources and have different types.我需要清理一个 dataframe，它的列来自不同的来源并且具有不同的类型。 This means that I can have, for example, string columns that contain "nan", "none", "NULL", (as a string instead of a None value).这意味着我可以拥有包含“nan”、“none”、“NULL”（作为字符串而不是 None 值）的字符串列。

My goal is to find all empty values and replace them with None.我的目标是找到所有空值并将它们替换为 None。 This works fine:这很好用：

for column in df.columns:
    for idx, row in df.iterrows():
        if (str(row[column]).lower() == "none") or if (str(row[column]).lower() == "nan") or (str(row[column]).lower() == "null"):
            df.at[row.name, column] = None

But it is obviously not the best or fastest way to do it.但这显然不是最好或最快的方法。 How can I take advantage of Pandas operations or list comprehensions to do this substitution?我如何利用 Pandas 操作或列表理解来进行此替换？ Thanks!谢谢！

Answer 1

This seems to be a somewhat controversial topic (see eg this thread ) but it's often said that list comprehensions are more computationally efficient than for loops, especially when iterating over pandas dataframes.这似乎是一个有点争议的话题（参见例如这个线程），但人们常说列表理解比 for 循环在计算上更有效，尤其是在迭代 pandas 数据帧时。

I also prefer using list comprehensions stylistically as it leads to fewer levels of indentation from nested loops/if statements.我也更喜欢在风格上使用列表理解，因为它可以减少嵌套循环/if 语句的缩进级别。

Here's what it looks like for your use case:这是您的用例的样子：

for column in df.columns:
    vals_list = df[column].to_list()
    replaced = [None if str(x).lower() in ['nan', 'none', 'null'] else x for x in vals_list]
    df[column] = replaced

Answer 2

Simple approach, use isin and mask :简单的方法，使用isin和mask ：

df = pd.DataFrame([[1,2,'nan'],
                   ['none',3,'NULL']])

df_clean = df.mask(df.isin(["nan", "none", "NULL"]))

Or, if you want to update in place:或者，如果您想就地更新：

df[df.isin(["nan", "none", "NULL"])] = float('nan')

Output: Output：

     0  1    2
0    1  2  NaN
1  NaN  3  NaN

Answer 3

A quick, and easy optimization:快速简单的优化：

for column in df.columns:
    for idx, row in df.iterrows():
        col = str(row[column]).lower()
        if (col == "none") or if (col == "nan") or (col == "null"):
            df.at[row.name, column] = None

No need to convert row[column] to a str and then iterate over each character 3 times.无需将row[column]转换为str然后遍历每个字符 3 次。

Shorter code:较短的代码：

its_none = ['none', 'nan', 'null']
for column in df.columns:
    for idx, row in df.iterrows():
        if str(row[column]).lower() in its_none:
            df.at[row.name, column] = None

Even shorter (I imagine you're expecting a number) and more optimized:更短（我想你期待一个数字）和更优化：

for column in df.columns:
    for idx, row in df.iterrows():
        if str(row[column]).lower().startswith('n'):
            df.at[row.name, column] = None

Answer 4

If you want to use numpy you could do this as well (if the values in the fields are truly a string)如果你想使用 numpy 你也可以这样做（如果字段中的值确实是一个字符串）

import pandas as pd
import numpy as np

df = pd.DataFrame({
    'name' : ['one', 'two', 'one', 'two'],
    'A' : ['null', 'none', 'empty', 'Keep']
})

df['A'] = np.where(df['A'].isin(['null', 'none', 'empty']), '', df['A'])
df

Pandas Dataframe - 在所有列中用 None 替换 None-like 值

问题描述

4 个解决方案

解决方案1
1 已采纳 2022-11-18 05:33:07

解决方案2
1 2022-11-18 05:34:04

解决方案3
0 2022-11-18 05:22:52

解决方案4
0 2022-11-18 05:25:36

Pandas Dataframe - 在所有列中用 None 替换 None-like 值

问题描述

4 个解决方案

解决方案1 1 已采纳 2022-11-18 05:33:07

解决方案2 1 2022-11-18 05:34:04

解决方案3 0 2022-11-18 05:22:52

解决方案4 0 2022-11-18 05:25:36

解决方案1
1 已采纳 2022-11-18 05:33:07

解决方案2
1 2022-11-18 05:34:04

解决方案3
0 2022-11-18 05:22:52

解决方案4
0 2022-11-18 05:25:36