[英]how can i find the missing values in my data frame and what is the best method for handle this missing values?
Assume the following data frame:假设以下数据框:
"?"==missing value "?"==缺失值
how can i find "?"我怎样才能找到“?” in this data frame by python and how can i handle this missing values by the bestest method?在 python 的这个数据框中,我怎样才能用最好的方法处理这个缺失值?
col1 col2 col3 col4 col5 target
? 1 ? 1 20 0
90 1 47 0 40 1
75 ? 246 ? 15 0
60 1 315 1 60 0
78 0 224 0 50 0
48 1 ? 1 ? 1
65 1 135 0 35 1
73 0 582 0 35 1
70 0 1202 0 50 1
54 1 427 0 70 1
68 1 1021 1 35 0
55 0 ? 1 35 1
What do you mean by "finding"? “发现”是什么意思?
For example, you could get the null values like this:例如,您可以获得 null 值,如下所示:
df.loc[df['col1'].isnull(), 'col']
There is no "best method" to fill in missing values - it depends on the nature of your data.没有“最佳方法”来填充缺失值——这取决于数据的性质。 For example, you can fill in missing values with zeros like this:例如,您可以像这样用零填充缺失值:
df = df.fillna(0)
Assuming your data frame (more details would be helpful) is a Pandas data frame named df
:假设您的数据框(更多详细信息会有所帮助)是一个名为df
的 Pandas 数据框:
import numpy as np
df = df.replace('?',np.NaN)
This replaces all ?
这取代了所有?
with numpy's NaN, which then are handled accordingly in calculation and/or plotting.使用 numpy 的 NaN,然后在计算和/或绘图中进行相应处理。
If you read the data with pd.read_csv()
you can directly used the na_values
option:如果您使用pd.read_csv()
读取数据,则可以直接使用na_values
选项:
import pandas as pd
df = pd.read_csv('datafile.csv', na_values = "?")
There are multiple ways to handle missing data.有多种方法可以处理丢失的数据。 Here are some of them -这里是其中的一些 -
Depending on the amount, type and end task you have with the dataset, each of these methods has its own place.根据您对数据集的数量、类型和最终任务,这些方法中的每一个都有自己的位置。
You can read about this in more detail with code examples in my blog Handling missing data (like a boss!)您可以在我的博客处理缺失数据(像老板一样!)
A library I usually recommend is called missingno
.我通常推荐的一个库称为missingno
。 This lets you analyze your missing data visually as well as with nullity correlations (existential correlation between variables).这使您可以直观地分析丢失的数据以及无效相关性(变量之间的存在相关性)。 Github repo and documentation here . Github repo 和文档在这里。
The first step would be to analyze the missing data by changing the ?
第一步是通过更改?
来分析缺失的数据。 to nan values, and getting a sense of how much of it exists. nan 值,并了解它的存在量。 You can use missingno.matrix
and missingno.bar
to visualize this.您可以使用missingno.matrix
和missingno.bar
来可视化这一点。
#!pip install missingno
import missingno as msno
import numpy as np
df_raw = df.replace('?', np.nan)
print(df_raw)
# col1 col2 col3 col4 col5 target
# 0 NaN 1 NaN 1 20 0
# 1 90 1 47 0 40 1
# 2 75 NaN 246 NaN 15 0
# 3 60 1 315 1 60 0
# 4 78 0 224 0 50 0
# 5 48 1 NaN 1 NaN 1
# 6 65 1 135 0 35 1
# 7 73 0 582 0 35 1
# 8 70 0 1202 0 50 1
# 9 54 1 427 0 70 1
# 10 68 1 1021 1 35 0
# 11 55 0 NaN 1 35 1
You can visually represent the missing values as white and existing values as black.您可以直观地将缺失值表示为白色,将现有值表示为黑色。 As you can see below, a few of your rows have too many missing values (only 4 values available).正如您在下面看到的,您的一些行有太多缺失值(只有 4 个值可用)。 Based on this you can make a decision to keep or chuck these rows if you want.基于此,您可以根据需要决定保留或丢弃这些行。
msno.matrix(df_raw)
Total number of missing values column-wise with bar chart.使用条形图按列显示缺失值的总数。 Your col3
has the highest number of missing values.您的col3
的缺失值数量最多。 If you replace these values with say a mean, you have to ask yourself if the mean of the remaining values is sufficient enough to represent those values!如果你用一个平均值替换这些值,你必须问自己剩余值的平均值是否足以代表这些值!
msno.bar(df_raw)
From hereon, you have to decide the purpose of each row/column before taking a decision to handle missing data.从这里开始,您必须在决定处理缺失数据之前确定每一行/列的用途。 A quick way would be to simply bump each missing value as 0. But, each column in your case represents something different.一种快速的方法是简单地将每个缺失值设为 0。但是,您的案例中的每一列都代表不同的东西。 Some are binary while others are continuous.有些是二元的,而另一些是连续的。
You may want to handle that fact independently.你可能想独立处理这个事实。
If you want to replace a selected column by its mean, you can simply do -如果您想用平均值替换选定的列,您可以简单地执行 -
column = 'col1' #change column name
df[column] = df[column].fillna(df[column].mean())
Modifying the above code will help you handle each column separately (works for rows as well using df.iloc[]
)修改上述代码将帮助您分别处理每一列(也适用于行以及使用df.iloc[]
)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.