根據列表中符號的子串替換 Pandas DataFrame 列的值

Question

我正在嘗試從 DataFrame 中的兩列中刪除一些錯誤數據。這些列容易損壞，其中符號出現在列值中。 我想檢查兩列中的所有值，並在存在符號時用 '' 替換標識的值。

例如：

import pandas as pd

bad_chars = [')', ',', '@', '/', '!', '&', '*', '.', '_', ' ']


d = {'p1' : [1,2,3,4,5,6],
    'p2' : ['abc*', 'abc@', 'zxya', '&sdf', 'p xx', 'abcd'],
    'p3' : ['abc', 'abc.', 'zxya', '&sdf', 'p xx', 'abcd']}

df = pd.DataFrame(d) 

    p1  p2      p3
0   1   abc*    abc
1   2   abc@    abc.
2   3   zxya    zxya
3   4   &sdf    &sdf
4   5   p xx    p xx
5   6   abcd    abcd

我一直在嘗試使用列表理解來遍歷 bad_chars 變量並將 p2 和 p3 列中的值替換為空 '' ，但沒有成功，結果如下：

    p1  p2      p3
0   1           abc
1   2           
2   3   zxya    zxya
3   4       
4   5       
5   6   abcd    abcd

完成此操作后，我想刪除 p2 列、p3 列或兩者中包含空單元格的所有行。

    p1  p2      p3
0   3   zxya    zxya
1   6   abcd    abcd

Answer 1

給你go：

import pandas as pd

bad_chars = ['\,', '\@', '\/', '\!', '\&', '\*', '\.', '\_', '\ ']


d = {'p1' : [1,2,3,4,5,6],
    'p2' : ['abc*', 'abc@', 'zx_ya', '&sdf', 'p xx', 'abcd'],
    'p3' : ['abc', 'abc.', 'zxya', '&sdf', 'p xx', 'abcd']}

df = pd.DataFrame(d)
df.loc[df['p2'].str.contains('|'.join(bad_chars)), 'p2'] = None
df.loc[df['p3'].str.contains('|'.join(bad_chars)), 'p3'] = None
df = df.dropna(subset=['p2', 'p3'])
df

請注意，我已經更改了 bad_chars（向其中添加了 \）

Answer 2

您可以嘗試的另一種選擇。

import pandas as pd

bad_chars = [')', ',', '@', '/', '!', '&', '*', '.', '_', ' ']

d = {'p1' : [1,2,3,4,5,6],
    'p2' : ['abc*', 'abc@', 'zxya', '&sdf', 'p xx', 'abcd'],
    'p3' : ['abc', 'abc.', 'zxya', '&sdf', 'p xx', 'abcd']}
df = pd.DataFrame(d)


for i in df.index:
    # creates True/False list checking each char in df cell's
    # content using line comprehension
    p2_chks = [char in bad_chars for char in df.at[i,"p2"]]
    p3_chks = [char in bad_chars for char in df.at[i,"p3"]]

    # if "True" exists in the either of the check lists,
    # then delete the row
    if (True in p2_chks) or (True in p3_chks):
        print("{}: p2 or p3 three is true".format(i))
        df = df.drop(i)

# Reindex the df rows. Use drop=True so 
# new column is not added with old index
df = df.reset_index(drop=True)
print(df)

Answer 3

請試試這個：

import pandas as pd
import numpy as np
bad_chars = [')', ',', '@', '/', '!', '&', '*', '.', '_', ' ']


d = {'p1' : [1,2,3,4,5,6],
    'p2' : ['abc*', 'abc@', 'zxya', '&sdf', 'p xx', 'abcd'],
    'p3' : ['abc', 'abc.', 'zxya', '&sdf', 'p xx', 'abcd']}

df = pd.DataFrame(d)
def check_char(text):

    for char in bad_chars:
        if char in text:
            return np.nan
            break
    return text

check_cols = ['p2','p3']
for col in check_cols:
    df[col] = df[col].apply(lambda x:check_char(x))
df = df.dropna(subset=check_cols)

根據列表中符號的子串替換 Pandas DataFrame 列的值

問題描述

3 個解決方案

解決方案1
1 已采納 2022-03-08 13:28:09

解決方案2
0 2022-03-08 14:04:04

解決方案3
0 2022-03-08 14:05:33

根據列表中符號的子串替換 Pandas DataFrame 列的值

問題描述

3 個解決方案

解決方案1 1 已采納 2022-03-08 13:28:09

解決方案2 0 2022-03-08 14:04:04

解決方案3 0 2022-03-08 14:05:33

解決方案1
1 已采納 2022-03-08 13:28:09

解決方案2
0 2022-03-08 14:04:04

解決方案3
0 2022-03-08 14:05:33