简体   繁体   中英

Create 'incremental' log (pandas df) from time based data based on column's previous and next value

This is an attempt to resolve a data quality issue in time based sensor data, while creating a new 'log' from it. This is the issue -

Ideally, Parameter X increases with time at an acceptable rate or stays constant (it can never decrease). But in the actual data this is not the case due to caliberation issues. The two cases which are both physically not possible -

  • case 1) X may suddenly increase for from 50000 to 300,000 , then stay at the value for a short duration. When the error is spotted, the data is reset back to approximately 50000.
  • case 2) X may suddenly decrease from from say 80000 to 50000, then stay at the value for a short duration. When the error is spotted, the data is reset back to approximately 80000.

In the first iteration, a log was created wherever dX > 0.01 (difference between actual and shifted column value). Which pulled in both cases of bad data into the new log. In my attempt clear out these bad data I wrote the program below based on the logs from the first iteration

Below program resolves case 2. But the solution leads to much worse scenario in case 1. If there is a sudden increase to say 300,000 , the 'log' will not update beyond the point at which it first reached 300,000. Hence useful data is lost from where it was reset back to 50000.

import pandas as pd

data = {'time':[43254.09605,43254.09606,43254.09609,43254.09613,43254.09616,43254.09618,43254.09719,43254.09721,43254.09723,43254.09725]
,'X': [50000,50000.2,50000.4,300000.2,300000.4,300000.6,50000.1,50000.2,50000.4,50000.6]
,'dX':[0.19995117,0.19995117,0.19995117,32002.398,0.19921875,0.203125,0.100097656,0.099853516,0.19995117,0.20019531]
,'dX2':[None,0.2,0.2,249999.8,0.2,0.2,-250000.5,0.1,0.2,0.2]}

df = pd.DataFrame.from_dict(data)

def log_maker (df):

    prev = 0
    list_df = []

    for i in range(1, len(df)):

        curr = df.loc[i, 'X']

        if (df.loc[i, 'dX'] > 0.01) and (df.loc[i, 'dX'] < 10) and (curr > prev):

            list_df.append(df.loc[i,])
            prev = curr #updates prev even if curr is a bad value, in our case 300000.4

    return list_df

where dX2 is the shift in the X in new log. While the dX comes from the original time log.

I was thinking if there is a way I can store the last good row, and compare curr with only the last good prev .

I am not a programmer, but an expert in this specific sensor data. So please, if there are any questions about that I can answer them. So, (df.loc[i, 'dX'] < 10) is a resonable 'rate' at which X can increase. Also, I can't set a hard criteria, like X cannot be greater than 299999 because there may be sudden increase from 50000 to 55000 , which is also incorrect but 55000 itself is a correct value that will come later in the log with increasing time.

I ended up using dX_real = curr - prev instead of dX which was helpful in resolving the issue for most cases. This lets prev be updated only by correct values. Also set an initial condition to get first good point in case the time log starts at a wrong (very high X) value (that code is not here).

def log_maker (df):

    prev = 0
    list_df = []

    for i in range(1, len(df)):

        curr = df.loc[i, 'X']
        dX_real = curr - prev

        if (dX_real  > 0.01) and (dX_real  < 10):

            list_df.append(df.loc[i,])
            prev = curr #updates prev even if curr is a bad value, in our case 300000.4

    return list_df

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM