简体   繁体   中英

fill missing data with Python

I am little new to Python and have a problem like this. I have a dataframe of multiple sensor data. There are NA missing values in the dataset and need to be filled with below rules.

  1. if the next sensor has data at the same time stamp, fill it using the next sensor data.
  2. If near sensor has no data either, fill it with average value of all available sensors at the same timestamp.
  3. If all sensor missing data at the same timestamp, use linear interpolation of it's own to fill the missing values

There's a sample data I built.

import pandas as pd
sensor1 = pd.DataFrame({"date": pd.date_range('1/1/2000', periods=10),"sensor":[1,1,1,1,1,1,1,1,1,1],"value":[np.nan,2,2,2,2,np.nan,np.nan,np.nan,4,6]})
sensor2 = pd.DataFrame({"date": pd.date_range('1/1/2000', periods=10),"sensor":[2,2,2,2,2,2,2,2,2,2],"value":[3,4,5,6,7,np.nan,np.nan,np.nan,7,8]})
sensor3 = pd.DataFrame({"date": pd.date_range('1/1/2000', periods=10),"sensor":[3,3,3,3,3,3,3,3,3,3],"value":[2,3,4,5,6,7,np.nan,np.nan,7,8]})
sensordata = sensor1.append([sensor2,sensor3]).reset_index(drop = True)

Any help would be appreciated.

With the answer from Christian, the solution will be as follows.

# create  data
df1 = pd.DataFrame({"date": pd.date_range('1/1/2000', periods=10),"sensor":[1,1,1,1,1,1,1,1,1,1],"value":[np.nan,2,2,2,2,np.nan,np.nan,np.nan,4,6]})
df2 = pd.DataFrame({"date": pd.date_range('1/1/2000', periods=10),"sensor":[2,2,2,2,2,2,2,2,2,2],"value":[3,4,5,6,7,np.nan,np.nan,np.nan,7,8]})
df3 = pd.DataFrame({"date": pd.date_range('1/1/2000', periods=10),"sensor":[3,3,3,3,3,3,3,3,3,3],"value":[2,3,4,5,6,7,np.nan,np.nan,7,8]})
df = df1.append([df2,df3]).reset_index(drop = True)

# pivot dataframe
df = df.pivot(index = 'date', columns ='sensor',values ='value')

# step 1, using specified sensor to fill missing values first, here use sensor 3
for c in df.columns:
   selectedsensor = 3
   df[c] = df[c].fillna(df[selectedsensor])

# step 2, use average of all available sensors to fill
df = df.transpose().fillna(df.transpose().mean()).transpose()

# step 3, use interpolate to fill remaining missing values
df = df.interpolate()

# unstack back to the original data format
df = df.reset_index()
df = df.melt(id_vars=['date'],var_name = 'sensor')
#df = df.unstack('sensor').reset_index()
#df = df.rename(columns ={0:'value'})

The final output is as follows:

         date sensor  value
0  2000-01-01      1    2.0
1  2000-01-02      1    2.0
2  2000-01-03      1    2.0
3  2000-01-04      1    2.0
4  2000-01-05      1    2.0
5  2000-01-06      1    7.0
6  2000-01-07      1    6.0
7  2000-01-08      1    5.0
8  2000-01-09      1    4.0
9  2000-01-10      1    6.0
10 2000-01-01      2    3.0
11 2000-01-02      2    4.0
12 2000-01-03      2    5.0
13 2000-01-04      2    6.0
14 2000-01-05      2    7.0
15 2000-01-06      2    7.0
16 2000-01-07      2    7.0
17 2000-01-08      2    7.0
18 2000-01-09      2    7.0
19 2000-01-10      2    8.0
20 2000-01-01      3    2.0
21 2000-01-02      3    3.0
22 2000-01-03      3    4.0
23 2000-01-04      3    5.0
24 2000-01-05      3    6.0
25 2000-01-06      3    7.0
26 2000-01-07      3    7.0
27 2000-01-08      3    7.0
28 2000-01-09      3    7.0
29 2000-01-10      3    8.0

You can do the following:

Your dataset, pivoted:

df = pd.DataFrame({"date": pd.date_range('1/1/2000', periods=10),"sensor1":[np.nan,2,2,2,2,np.nan,np.nan,np.nan,4,6], "sensor2":[3,4,5,6,7,np.nan,np.nan,np.nan,7,8], "sensor3":[2,3,4,5,6,7,np.nan,np.nan,7,8]}).set_index('date')

1) This is fillna with options backward, and limit = 1 along axis 1

df.fillna(method='bfill',limit=1,axis=1)

2) This is fillna with mean along the axis 1. This isn't really implemented apparently, but we can trick it with transposing:

df.transpose().fillna(df.transpose().mean()).transpose()

3) This is just interpolate

df.interpolate()

Bonus:

This got a bit uglier, since i had to apply column by column, but here is one selecting sensor 3 to fill:

for c in df.columns:
   df[c] = df[c].fillna(df["sensor3"])
df

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM