简体   繁体   中英

How to reject a window containing an outlier with a condition during rolling average using python?

The problem that I am facing is how i can reject a window of 10 rows if one or many of the rows consist of an outlier while computing rolling average using python pandas? The assistance i require in is the conditional logic based on the following scenarios mentioned below

The condition on the outlier in a window is:

  • The upper bound for outlier is 15, the lower bound is 0

  • if the frequency of occurrence of outlier in a window is greater than 10%, we reject that particular window and move next.

  • if the frequency of occurrence of outlier in a window is less than 10%, we accept the particular window with the following changes: 1) replace the value of the outlier with the value derived from the average of the non-outlier values ie the rest of the 9 rows, then averaging the same window again before moving next

Here's the following code till now:

_filter = lambda x: float("inf") if x > 15 or x < 0 else x

#Apply the mean over window with inf to result those values in  
result = df_list["speed"].apply(_filter).rolling(10).mean().dropna()

#Print Max rolling average
print("The max rolling average is:")

result.max()

Use rolling with a custom aggregation function:

df = pd.DataFrame({"a": range(100), "speed": np.random.randint(0, 17, 100)})

MAX = 15
MIN = 0
def my_mean(s):
    outlier_count = ((s<MIN) | (s > MAX)).sum()
    if outlier_count > 2: # defined 2 as the threshold - can put any other number here
        return np.NaN
    res =  s[(s <= MAX) & (s >= MIN)].mean()
    return res

df["roll"] = df.speed.rolling(10).apply(my_mean)

This results, in one example, in:

    ...
    35  35  8   9.444444
    36  36  14  9.666667
    37  37  11  9.888889
    38  38  16  10.250000
    39  39  16  NaN
    40  40  15  NaN
    41  41  6   NaN
    42  42  9   11.375000
    43  43  2   10.000000
    44  44  8   9.125000
    ...

What happens here is as follows:

  • We create a rolling window of size 10 ( df.speed.rolling(10) )
  • For each window, which is a series of 10 numbers, we apply the function my_mean .
  • my_mean first counts the number of outliers, by summing the number of cases in which elements in the series s are smaller than the minimum or larger that the maximum.
  • if the count is outliers is too large, we just say that there's no mean and return not-a-number.
  • Otherwise, we filter out outliers and calculate the mean of the other numbers ( s[(s <= MAX) & (s >= MIN)].mean() ).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM