简体   繁体   English

获取 Pandas Dataframe 中每个特征的值不正确的行百分比的最快方法

[英]Fastest way to get Percent of rows with incorrect values for each feature in a Pandas Dataframe

Below code is what I have.下面的代码是我所拥有的。 Seems to work for ?, ' and '' but not for np.NaN .似乎适用于?, '''但不适用于np.NaN Any suggestions?有什么建议?

Also, I am new to Pandas/Python and hence would like to know if there is a faster way to do this另外,我是 Pandas/Python 的新手,因此想知道是否有更快的方法来做到这一点

I am thinking of treating features as suspect if more than X%(say 5%) of the rows have missing values.如果超过 X%(比如 5%)的行有缺失值,我正在考虑将特征视为可疑。 Any other data sanitization initial checks that you regularly use您经常使用的任何其他数据清理初始检查

for col in df.columns:
  pcnt_missing = df[df[col].isin(['?','',' ',np.NaN])][col].count() * 100.0 / df[col].count()
  if pcnt_missing > 1:
    print(f"Col = {col}, Percent missing ={pcnt_missing:.2f}")

If you can replace the values ?如果可以替换值? , '' , and ' ' with np.nan , you can easily compute the percentage of missing values by using the sum and the length of the DataFrame. ''' '使用np.nan ,您可以使用数据np.nansum和长度轻松计算缺失值的百分比。 You can replace the missing values with an apply :您可以使用apply替换缺失值:

import pandas as pd
import numpy as np

df = pd.DataFrame({'a': [1,2,3,4], 'b': [2, '', '?', 4], 'c': [' ', np.nan, '', 5]})

def replace(x):
    idx = x.isin(['', ' ', '?'])
    x[idx] = np.nan
    return x

replaced = df.apply(replace, axis=1) % Values are replaced here

Now you can compute the percentage of missing values for each column with this:现在,您可以使用以下命令计算每列缺失值的百分比:

replaced.isna().sum(axis=0) * 100 / len(replaced)

Output:

a     0.0
b    50.0
c    75.0
dtype: float64

Use boolean logic with isna , using @Ricardo Erikson setup:使用布尔逻辑与isna ,使用@Ricardo埃里克森设置:

df = pd.DataFrame({'a': [1,2,3,4], 'b': [2, '', '?', 4], 'c': [' ', np.nan, '', 5]})

(df.isna() | df.isin(['?','',' '])).mean()

Output:输出:

a    0.00
b    0.50
c    0.75
dtype: float64

Check for NaN with isna and use |使用isna检查 NaN 并使用| , OR boolean operator, and the use isin , plus you can use mean to find the percentage missing. , OR 布尔运算符,并使用isin ,另外您可以使用mean来查找缺失的百分比。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 为 Pandas 数据帧中的每一行搜索和更新值的最快方法 - fastest way for searching and updating values for every rows in Pandas dataframe 为列表中的值组合创建熊猫数据框行的最快方法 - fastest way to create pandas dataframe rows for combination of values from lists 将 2 个 Pandas 列彼此相乘并获得值总和的最快方法 - Fastest way to multiply 2 Pandas columns with each other and get the sum of the values 对熊猫数据框中每一行进行排序的最快方法 - Fastest way to sort each row in a pandas dataframe 向现有熊猫数据框添加行的最快方法 - Fastest way to add rows to existing pandas dataframe 在 pandas dataframe 中加入 coulmn 值的最快方法? - Fastest way to join coulmn values in pandas dataframe? 删除包含熊猫数据帧同一列中值的子字符串的行的最快方法 - Fastest way to remove rows that contain substrings of values in the same column of a pandas dataframe 删除行/获取子集的最快方法与Pandas中的大型DataFrame不同 - Fastest way to drop rows / get subset with difference from large DataFrame in Pandas 最快的方法来比较pandas数据帧中的行和上一行以及数百万行 - Fastest way to compare row and previous row in pandas dataframe with millions of rows 选择Pandas数据框中包含值的行的最快方法是什么? - What is the fastest way to select rows that contain a value in a Pandas dataframe?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM