简体   繁体   English

有条件地将pandas.DataFrame中的值替换为先前的值

[英]Conditionally replace values in pandas.DataFrame with previous value

I need to filter outliers in a dataset. 我需要过滤数据集中的异常值。 Replacing the outlier with the previous value in the column makes the most sense in my application. 在我的应用程序中,用列中的先前值替换异常值最有意义。

I was having considerable difficulty doing this with the pandas tools available (mostly to do with copies on slices, or type conversions occurring when setting to NaN ). 使用可用的pandas工具很难做到这一点(主要是与切片上的副本或设置为NaN时发生的类型转换有关)。

Is there a fast and/or memory efficient way to do this? 有没有一种快速和/或内存有效的方法来做到这一点? (Please see my answer below for the solution I am currently using, which also has limitations.) (请参阅下面有关我当前使用的解决方案的回答,该解决方案也有局限性。)


A simple example: 一个简单的例子:

>>> import pandas as pd
>>> df = pd.DataFrame({'A':[1,2,3,4,1000,6,7,8],'B':list('abcdefgh')})
>>> df
      A  B
0     1  a
1     2  b
2     3  c
3     4  d
4  1000  e # '1000  e' --> '4  e'
5     6  f
6     7  g
7     8  h

You can simply mask values over your threshold and use ffill : 您可以简单地屏蔽超过阈值的值并使用ffill

df.assign(A=df.A.mask(df.A.gt(10)).ffill())

     A  B
0  1.0  a
1  2.0  b
2  3.0  c
3  4.0  d
4  4.0  e
5  6.0  f
6  7.0  g
7  8.0  h

Using mask is necessary rather than something like shift , because it guarantees non-outlier output in the case that the previous value is also above a threshold. 使用mask是必要的,而不是shift东西,因为它可以确保在先前值也高于阈值的情况下输出非离群值。

I circumvented some of the issues with pandas copies and slices by converting to a numpy array first, doing the operations there, and then re-inserting the column. 我先通过转换为numpy数组,然后在该处执行操作,然后重新插入该列,来规避了有关pandas副本和切片的一些问题。 I'm not certain, but as far as I can tell, the datatype is the same once it is put back into the pandas.DataFrame . 我不确定,但是据我所知,一旦将其放回pandas.DataFrame ,数据类型便是相同的。

def df_replace_with_previous(df,col,maskfunc,inplace=False):
    arr = np.array(df[col])
    mask = maskfunc(arr)
    arr[ mask ] = arr[ list(mask)[1:]+[False] ]
    if inplace:
        df[col] = arr
        return
    else:
        df2 = df.copy()
        df2[col] = arr
        return df2

This creates a mask, shifts it down by one so that the True values point at the previous entry, and updates the array. 这将创建一个掩码,将其向下移动一个,以使True值指向上一个条目,并更新数组。 Of course, this will need to run recursively if there are multiple adjacent outliers (N times if there are N consecutive outliers), which is not ideal. 当然,如果有多个相邻的离群值,则需要递归运行(如果有N个连续的离群值,则需要N次),这是不理想的。

Usage in the case given in OP: 在OP中给出的用法:

df_replace_with_previous(df,'A',lambda x:x>10,False)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM