简体   繁体   English

如何在我的数据框中找到缺失值,处理这些缺失值的最佳方法是什么?

[英]how can i find the missing values in my data frame and what is the best method for handle this missing values?

Assume the following data frame:假设以下数据框:

"?"==missing value "?"==缺失值

how can i find "?"我怎样才能找到“?” in this data frame by python and how can i handle this missing values by the bestest method?在 python 的这个数据框中,我怎样才能用最好的方法处理这个缺失值?

col1   col2     col3    col4    col5   target
?       1         ?       1      20     0
90      1         47      0      40     1
75      ?        246      ?      15     0
60      1        315      1      60     0
78      0        224      0      50     0
48      1         ?       1       ?     1
65      1        135      0      35     1
73      0        582      0      35     1
70      0        1202     0      50     1
54      1        427      0      70     1
68      1        1021     1      35     0
55      0         ?       1      35     1

What do you mean by "finding"? “发现”是什么意思?

For example, you could get the null values like this:例如,您可以获得 null 值,如下所示:

df.loc[df['col1'].isnull(), 'col']

There is no "best method" to fill in missing values - it depends on the nature of your data.没有“最佳方法”来填充缺失值——这取决于数据的性质。 For example, you can fill in missing values with zeros like this:例如,您可以像这样用零填充缺失值:

df = df.fillna(0)

Assuming your data frame (more details would be helpful) is a Pandas data frame named df :假设您的数据框(更多详细信息会有所帮助)是一个名为df的 Pandas 数据框:

import numpy as np
df = df.replace('?',np.NaN)

This replaces all ?这取代了所有? with numpy's NaN, which then are handled accordingly in calculation and/or plotting.使用 numpy 的 NaN,然后在计算和/或绘图中进行相应处理。

If you read the data with pd.read_csv() you can directly used the na_values option:如果您使用pd.read_csv()读取数据,则可以直接使用na_values选项:

import pandas as pd 
df = pd.read_csv('datafile.csv', na_values = "?")

There are multiple ways to handle missing data.有多种方法可以处理丢失的数据。 Here are some of them -这里是其中的一些 -

  • Remove rows with missing data删除缺少数据的行
  • Remove rows for specific variables删除特定变量的行
  • Drop variables with missing data删除缺少数据的变量
  • Impute missing data with fixed values (like 0, -1, etc)用固定值(如 0、-1 等)估算缺失数据
  • Impute missing data with central tendencies (like mean, median etc)用中心趋势(如均值、中位数等)估算缺失数据
  • Interpolate row sequence or columns sequence data插入行序列或列序列数据
  • Predict missing data using some ML model使用一些 ML model 预测缺失数据
  • Denoising methods去噪方法

Depending on the amount, type and end task you have with the dataset, each of these methods has its own place.根据您对数据集的数量、类型和最终任务,这些方法中的每一个都有自己的位置。

You can read about this in more detail with code examples in my blog Handling missing data (like a boss!)您可以在我的博客处理缺失数据(像老板一样!)

A library I usually recommend is called missingno .我通常推荐的一个库称为missingno This lets you analyze your missing data visually as well as with nullity correlations (existential correlation between variables).这使您可以直观地分析丢失的数据以及无效相关性(变量之间的存在相关性)。 Github repo and documentation here . Github repo 和文档在这里


Detailed guide详细指南

1. Get to nans and not nans 1. 访问 nans 而不是 nans

The first step would be to analyze the missing data by changing the ?第一步是通过更改?来分析缺失的数据。 to nan values, and getting a sense of how much of it exists. nan 值,并了解它的存在量。 You can use missingno.matrix and missingno.bar to visualize this.您可以使用missingno.matrixmissingno.bar来可视化这一点。

#!pip install missingno
import missingno as msno 
import numpy as np


df_raw = df.replace('?', np.nan)
print(df_raw)

#    col1 col2  col3 col4 col5  target
# 0   NaN    1   NaN    1   20       0
# 1    90    1    47    0   40       1
# 2    75  NaN   246  NaN   15       0
# 3    60    1   315    1   60       0
# 4    78    0   224    0   50       0
# 5    48    1   NaN    1  NaN       1
# 6    65    1   135    0   35       1
# 7    73    0   582    0   35       1
# 8    70    0  1202    0   50       1
# 9    54    1   427    0   70       1
# 10   68    1  1021    1   35       0
# 11   55    0   NaN    1   35       1

2. Row-wise analysis 2.逐行分析

You can visually represent the missing values as white and existing values as black.您可以直观地将缺失值表示为白色,将现有值表示为黑色。 As you can see below, a few of your rows have too many missing values (only 4 values available).正如您在下面看到的,您的一些行有太多缺失值(只有 4 个值可用)。 Based on this you can make a decision to keep or chuck these rows if you want.基于此,您可以根据需要决定保留或丢弃这些行。

msno.matrix(df_raw)

在此处输入图像描述

3. Column-wise analysis 3. 逐列分析

Total number of missing values column-wise with bar chart.使用条形图按列显示缺失值的总数。 Your col3 has the highest number of missing values.您的col3的缺失值数量最多。 If you replace these values with say a mean, you have to ask yourself if the mean of the remaining values is sufficient enough to represent those values!如果你用一个平均值替换这些值,你必须问自己剩余值的平均值是否足以代表这些值!

msno.bar(df_raw)

在此处输入图像描述

From hereon, you have to decide the purpose of each row/column before taking a decision to handle missing data.从这里开始,您必须在决定处理缺失数据之前确定每一行/列的用途。 A quick way would be to simply bump each missing value as 0. But, each column in your case represents something different.一种快速的方法是简单地将每个缺失值设为 0。但是,您的案例中的每一列都代表不同的东西。 Some are binary while others are continuous.有些是二元的,而另一些是连续的。

You may want to handle that fact independently.你可能想独立处理这个事实。

If you want to replace a selected column by its mean, you can simply do -如果您想用平均值替换选定的列,您可以简单地执行 -

column = 'col1' #change column name
df[column] = df[column].fillna(df[column].mean())

Modifying the above code will help you handle each column separately (works for rows as well using df.iloc[] )修改上述代码将帮助您分别处理每一列(也适用于行以及使用df.iloc[]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 使用python查找数据框中的缺失值 - Find the missing values in data frame using python 有没有一种方法可以将不缺失的值保存在另一个数据框中? - Is there a method to save not missing values in another data frame? 如何在Python中的数据框中找到缺失值的位置 - How to find the location of missing values in a Data Frame in Python Python JSON 抓取 - 如何处理缺失值? - Python JSON scraping - how can I handle missing values? 如何用零替换不平衡数据框中的缺失值? - How can I replace missing values from an unbalanced data frame with zeros? 给定一个数据框,如何检查列的值按递增顺序排列而没有任何丢失的数字? - How can I check, given a data frame that the values of a column are in increasing order without any missing number? 创建具有缺失值的数据框 - Create data frame with missing values 如何处理 CSV 字典中的“缺失键值”并处理 Pandas 数据框? - How to handle 'missing key values' in CSV dictionary and working through Pandas data frame? 在数据框中查找“缺失”值的最佳方法是什么? - What's the best way to find “missing” values in a dataframe? 在此数据框中填充缺失值的最有效方法是什么? - What is the most efficient way to fill missing values in this data frame?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM