简体   繁体   English

使用 Pandas 检查 csv 文件中的缺失值

[英]To check missing values in csv file using Pandas

I have a large csv file.我有一个大的 csv 文件。 But for simplicity i have removed many rows and columns.但为简单起见,我删除了许多行和列。 It looks like below:如下所示:

col1 col1 col2 col2 col3 col3
? ? 27 27 13000 13000
? ? 27 27 13000 13000
validvalue有效值 30 30
# # 26 26 14000 14000
validvalue有效值 25 25

I want to detect missing values in this csv file.我想检测这个 csv 文件中的缺失值。 For eg: missing values indicated in col1 is by ?例如:col1 中的缺失值由? and # .和 # In col3 by empty cells .在 col3 中由空单元格 Things would have been easier if the data set has empty cells for all missing values.如果数据集的所有缺失值都有空单元格,事情会更容易。 In that case i could have gone for isnull function of pandas dataframe.在那种情况下,我本可以选择 pandas dataframe 的isnull function。 But the question is how to identify if the columns has other than empty space as missing value.但问题是如何识别列是否具有除空白以外的缺失值。

Approach if the csv has low number of records如果 csv 的记录数较少,请接近

df = pd.read_csv('test.csv')
for e in df.columns:
    print(a[e].unique()]

This will give us all unique value in that particular columns.这将为我们提供该特定列中的所有独特价值。 But i dont find it efficient.但我觉得它效率不高。

Is their any other way to detect missing values which are denoted by special characters such as (?,#,* etc.) in the csv file?他们是否有任何其他方法来检测 csv 文件中的特殊字符(如(?、#、* 等)表示的缺失值?

As you already stated正如你已经说过的

there is no way to find the garbage value other than using "unique" function.除了使用“唯一”function 之外,没有其他方法可以找到垃圾值。

But if the number of possible values is big you might help yourself, using .isalnum() to limit the values only to non-alfanumerical strings.但是,如果可能值的数量很大,您可能会帮助自己,使用.isalnum()将值限制为非字母数字字符串。 For example:例如:

df = pd.DataFrame({"col1": ['?', '?', 'validvalue', '$', 'validvalue'],
                   "col2": [27, 27, 30, 26, 25],
                   "col3": [13000, 13000, None, 14000, None]})

df[~df['col1'].str.isalnum()]['col1'].value_counts()

#Output:
#?    2
#$    1

When you will find all possible NA values, you might use mask on each column (if missings differ column to column) or on whole dataset, for example:当您找到所有可能的 NA 值时,您可能会在每一列(如果缺失的列与列之间的缺失不同)或整个数据集上使用mask ,例如:

na_values = ('?', '#')
df.mask(df.isin(na_values))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM