简体   繁体   English

你如何使用 numpy/scipy 处理丢失的数据?

[英]How do you deal with missing data using numpy/scipy?

One of the things I deal with most in data cleaning is missing values.我在数据清理中最常处理的一件事是缺失值。 R deals with this well using its "NA" missing data label. R 使用其“NA”缺失数据标签很好地处理了这个问题。 In python, it appears that I'll have to deal with masked arrays which seem to be a major pain to set up and don't seem to be well documented.在 python 中,似乎我将不得不处理掩码数组,这似乎是设置的主要痛苦并且似乎没有很好的文档记录。 Any suggestions on making this process easier in Python?关于在 Python 中简化此过程的任何建议? This is becoming a deal-breaker in moving into Python for data analysis.这正在成为使用 Python 进行数据分析的一大障碍。 Thanks谢谢

Update It's obviously been a while since I've looked at the methods in the numpy.ma module.更新自从我查看 numpy.ma 模块中的方法以来,显然已经有一段时间了。 It appears that at least the basic analysis functions are available for masked arrays, and the examples provided helped me understand how to create masked arrays (thanks to the authors).看来至少基本的分析函数可用于掩码数组,并且提供的示例帮助我了解如何创建掩码数组(感谢作者)。 I would like to see if some of the newer statistical methods in Python (being developed in this year's GSoC) incorporates this aspect, and at least does the complete case analysis.我想看看Python中的一些较新的统计方法(在今年的GSoC中开发)是否包含了这方面的内容,并且至少做了完整的案例分析。

If you are willing to consider a library, pandas (http://pandas.pydata.org/) is a library built on top of numpy which amongst many other things provides:如果你愿意考虑一个库,pandas (http://pandas.pydata.org/) 是一个建立在 numpy 之上的库,它提供了许多其他东西:

Intelligent data alignment and integrated handling of missing data: gain automatic label-based alignment in computations and easily manipulate messy data into an orderly form智能数据对齐和缺失数据的集成处理:在计算中获得基于标签的自动对齐,轻松将凌乱的数据处理成有序的形式

I've been using it for almost one year in the financial industry where missing and badly aligned data is the norm and it really made my life easier.我已经在金融行业使用它将近一年了,在这个行业中,丢失和不一致的数据是常态,它确实让我的生活更轻松。

I also question the problem with masked arrays.我也质疑掩码数组的问题。 Here are a couple of examples:下面是几个例子:

import numpy as np
data = np.ma.masked_array(np.arange(10))
data[5] = np.ma.masked # Mask a specific value

data[data>6] = np.ma.masked # Mask any value greater than 6

# Same thing done at initialization time
init_data = np.arange(10)
data = np.ma.masked_array(init_data, mask=(init_data > 6))

Masked arrays are the anwswer, as DpplerShift describes.正如 DpplerShift 所描述的,掩码数组是答案。 For quick and dirty use, you can use fancy indexing with boolean arrays:为了快速和肮脏的使用,您可以对布尔数组使用花哨的索引:

>>> import numpy as np
>>> data = np.arange(10)
>>> valid_idx = data % 2 == 0 #pretend that even elements are missing

>>> # Get non-missing data
>>> data[valid_idx]
array([0, 2, 4, 6, 8])

You can now use valid_idx as a quick mask on other data as well您现在也可以使用 valid_idx 作为其他数据的快速掩码

>>> comparison = np.arange(10) + 10
>>> comparison[valid_idx]
array([10, 12, 14, 16, 18])

See sklearn.preprocessing.Imputer参见sklearn.preprocessing.Imputer

import numpy as np
from sklearn.preprocessing import Imputer
imp = Imputer(missing_values='NaN', strategy='mean', axis=0)
imp.fit([[1, 2], [np.nan, 3], [7, 6]])
X = [[np.nan, 2], [6, np.nan], [7, 6]]
print(imp.transform(X))  

Example from http://scikit-learn.org/来自http://scikit-learn.org/ 的示例

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM