This is an attempt to resolve a data quality issue in time based sensor data, while creating a new 'log' from it. This is the issue -
Ideally, Parameter X increases with time at an acceptable rate or stays constant (it can never decrease). But in the actual data this is not the case due to caliberation issues. The two cases which are both physically not possible -
In the first iteration, a log was created wherever dX > 0.01 (difference between actual and shifted column value). Which pulled in both cases of bad data into the new log. In my attempt clear out these bad data I wrote the program below based on the logs from the first iteration
Below program resolves case 2. But the solution leads to much worse scenario in case 1. If there is a sudden increase to say 300,000 , the 'log' will not update beyond the point at which it first reached 300,000. Hence useful data is lost from where it was reset back to 50000.
import pandas as pd
data = {'time':[43254.09605,43254.09606,43254.09609,43254.09613,43254.09616,43254.09618,43254.09719,43254.09721,43254.09723,43254.09725]
,'X': [50000,50000.2,50000.4,300000.2,300000.4,300000.6,50000.1,50000.2,50000.4,50000.6]
,'dX':[0.19995117,0.19995117,0.19995117,32002.398,0.19921875,0.203125,0.100097656,0.099853516,0.19995117,0.20019531]
,'dX2':[None,0.2,0.2,249999.8,0.2,0.2,-250000.5,0.1,0.2,0.2]}
df = pd.DataFrame.from_dict(data)
def log_maker (df):
prev = 0
list_df = []
for i in range(1, len(df)):
curr = df.loc[i, 'X']
if (df.loc[i, 'dX'] > 0.01) and (df.loc[i, 'dX'] < 10) and (curr > prev):
list_df.append(df.loc[i,])
prev = curr #updates prev even if curr is a bad value, in our case 300000.4
return list_df
where dX2 is the shift in the X in new log. While the dX comes from the original time log.
I was thinking if there is a way I can store the last good row, and compare curr
with only the last good prev
.
I am not a programmer, but an expert in this specific sensor data. So please, if there are any questions about that I can answer them. So, (df.loc[i, 'dX'] < 10)
is a resonable 'rate' at which X can increase. Also, I can't set a hard criteria, like X cannot be greater than 299999
because there may be sudden increase from 50000 to 55000
, which is also incorrect but 55000
itself is a correct value that will come later in the log with increasing time.
I ended up using dX_real = curr - prev
instead of dX
which was helpful in resolving the issue for most cases. This lets prev
be updated only by correct values. Also set an initial condition to get first good point in case the time log starts at a wrong (very high X) value (that code is not here).
def log_maker (df):
prev = 0
list_df = []
for i in range(1, len(df)):
curr = df.loc[i, 'X']
dX_real = curr - prev
if (dX_real > 0.01) and (dX_real < 10):
list_df.append(df.loc[i,])
prev = curr #updates prev even if curr is a bad value, in our case 300000.4
return list_df
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.