简体   繁体   中英

Pandas: How to fill missing values in a large dataset?

I do have a large dataset (around 8 million rows x 25 columns) in Pandas and I am struggling to do one operation in a performant manner.

Here is how my dataset looks like:

                   temp   size
location_id hours             
135         78     12.0  100.0
            79      NaN    NaN
            80      NaN    NaN
            81     15.0  112.0
            82      NaN    NaN
            83      NaN    NaN
            84     14.0   22.0
  • I have a multi-index on [location_id, hours]. I have around 60k locations and 140 hours for each location (making up the 8 million rows).
  • The rest of the data is numeric (float). I have only included 2 columns here, normally there are around 20 columns.
  • What I am willing to do is to fill those NaN values by using the values around it. Basically, the value of hour 79 will be derived from the values of 78 and 81 . For this example, the temp value of 79 will be 13.0 (basic extrapolation).
  • I always know that only the 78, 81, 84 (multiples of 3) hours will be filled and the rest will have NaN . That will always be the case. This is true for hours between 78-120 .
  • With these in mind, I have implemented the following algorithm in Pandas:
df_relevant_data = df.loc[(df.index.get_level_values(1) >= 78) & (df.index.get_level_values(1) <= 120), :]

for location_id, data_of_location_id in df_relevant_data.groupby("location_id"):

        for hour in range(81, 123, 3):

            top_hour_data = data_of_location_id.loc[(location_id, hour), ['temp', 'size']] # e.g. 81
            bottom_hour_data = data_of_location_id.loc[(location_id, (hour - 3)), ['temp', 'size']] # e.g. 78

            difference = top_hour_data.values - bottom_hour_data.values
            bottom_bump = difference * (1/3) # amount to add to calculate the 79th hour
            top_bump = difference * (2/3) # amount to add to calculate the 80th hour

            df.loc[(location_id, (hour - 2)), ['temp', 'size']] = bottom_hour_data.values + bottom_bump
            df.loc[(location_id, (hour - 1)), ['temp', 'size']] = bottom_hour_data.values + top_bump

  • This works really well functionally, however the performance is horrible. It is taking at least 10 minutes on my dataset and that is currently not acceptable.
  • Is there a better/faster way to implement this? I am actually working only on a slice of the whole data (only hours between 78-120) so I would really expect it to work much faster.

I believe you are looking for interpolate :

print (df.interpolate())

                        temp   size
location_id hours
135         78     12.000000  100.0
            79     13.000000  104.0
            80     14.000000  108.0
            81     15.000000  112.0
            82     14.666667   82.0
            83     14.333333   52.0
            84     14.000000   22.0

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM