简体   繁体   中英

Pandas- find max not counting outliers

I have a dataframe where each column represents a geographic point, and each row represents a minute in a day. The value of each cell is the flow of water at that point in CFS. Below is a graph of one of these time-flow series.

Basically, I need to calculate the absolute value of the max flow at each of these locations during the day, which in this case would be that hump of 187 cfs. However, there are instabilities, so DF.abs().max() returns 1197 cfs. I need to somehow remove the outliers in the calculation. As you can see, there is no pattern to the outliers, but if you look at the graph, no 2 consecutive points in time should have more than an x% change in flow. I should mention that there are 15K of these points, so the fastest solution is the best.

Anyone know how can I accomplish this in python, or at least know the statistical word for what I want to do? Thanks!

在此处输入图像描述

在此处输入图像描述

In my opinion, the statistical word your are looking for is smoothing or denoising data.

Here is my try:

# Importing packages
import numpy as np
import matplotlib.pyplot as plt
from scipy.signal import savgol_filter

# Creating a curve with a local maximum to simulate "ideal data"
x = np.arange(start=-1, stop=1, step=0.001)
y_ideal = 10**-(x**2)

# Adding some randomly distributed outliers to simulate "real data"
y_real = y_ideal.copy()
np.random.seed(0)
for i in range(50):
    x_index = np.random.choice(len(x))
    y_real[x_index] = np.random.randint(-3, 5)

# Denoising with Savitzky-Golay (window size = 501, polynomial order = 3)
y_denoised = savgol_filter(y_real, window_length=501, polyorder=3)
# You should optimize these values to fit your needs

# Getting the index of the maximum value from the "denoised data"
max_index = np.where(y_denoised == np.amax(y_denoised))[0]

# Recovering the maximum value and reporting
max_value = y_real[max_index][0]
print(f'The maximum value is around {max_value:.5f}')

在此处输入图像描述

Please, keep in mind that:

  1. This solution is approximate .

  2. You should find the optimum parameters of the window_length and polyorder parameters plugged to the savgol_filter() function.

  3. If the region where your maximum is located is noisy, you can use max_value = y_denoised [max_index][0] instead of max_value = y_real[max_index][0] .

Note: This solution is based in this other Stack Overflow answer

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM