简体   繁体   English

从另一个DataFrame填充NaN值(具有不同的形状)

[英]Fill NaN values from another DataFrame (with different shape)

I'm looking for a faster approach to improve the performance of my solution for the following problem: a certain DataFrame has two columns with a few NaN values in them. 我正在寻找一种更快的方法来改善我的解决方案的性能,以解决以下问题:某个DataFrame有两列,其中包含几个NaN值。 The challenge is to replace these NaNs with values from a secondary DataFrame. 挑战是用来自辅助数据框架的值替换这些NaN

Below I'll share the data and code used to implement my approach. 下面我将分享用于实现我的方法的数据和代码。 Let me explain the scenario: merged_df is the original DataFrame with a few columns and some of them have rows with NaN values: 让我解释一下这个场景: merged_df是带有几列的原始DataFrame,其中一些列具有NaN值的行:

在此输入图像描述

As you can see from the image above, columns day_of_week and holiday_flg are of particular interest. 从上图中可以看出, day_of_weekholiday_flg列特别令人感兴趣。 I would like to fill the NaN values of these columns by looking into a second DataFrame called date_info_df , which looks like this: 我想通过查看名为date_info_df的第二个DataFrame来填充这些列的NaN值,如下所示:

在此输入图像描述

By using the values from column visit_date in merged_df it is possible to search the second DataFrame on calendar_date and find equivalent matches. 通过使用来自列中的值visit_datemerged_df有可能搜索的第二数据帧calendar_date并找到等价匹配。 This method allows to get the values for day_of_week and holiday_flg from the second DataFrame. 此方法允许从第二个DataFrame获取day_of_weekholiday_flg的值。

The end result for this exercise is a DataFrame that looks like this: 此练习的最终结果是DataFrame,如下所示:

在此输入图像描述

You'll notice the approach I'm using relies on apply() to execute a custom function on every row of merged_df : 您会注意到我正在使用的方法依赖apply()来在merged_df每一行上执行自定义函数:

  • For every row, search for NaN values in day_of_week and holiday_flg ; 对于每一行,在day_of_weekholiday_flg搜索NaN值;
  • When a NaN is found on any or both of these columns, use the date available in from that row's visit_date to find an equivalent match in the second DataFrame, specifically the date_info_df['calendar_date'] column; 当在这些列中的任何一列或两列上找到NaN时 ,使用该行的visit_date可用日期在第二个DataFrame中查找等效匹配,特别是date_info_df['calendar_date']列;
  • After a successful match, the value from date_info_df['day_of_week'] must be copied into merged_df['day_of_week'] and the value from date_info_df['holiday_flg'] must also be copied into date_info_df['holiday_flg'] . 一个成功的匹配之后,从值date_info_df['day_of_week']必须被复制到merged_df['day_of_week'] ,然后从值date_info_df['holiday_flg']必须也被复制到date_info_df['holiday_flg']

Here is a working source code : 这是一个有效的源代码

import math
import pandas as pd
import numpy as np
from IPython.display import display

### Data for df
data = { 'air_store_id':     [              'air_a1',     'air_a2',     'air_a3',     'air_a4' ], 
         'area_name':        [               'Tokyo',       np.nan,       np.nan,       np.nan ], 
         'genre_name':       [            'Japanese',       np.nan,       np.nan,       np.nan ], 
         'hpg_store_id':     [              'hpg_h1',       np.nan,       np.nan,       np.nan ],          
         'latitude':         [                  1234,       np.nan,       np.nan,       np.nan ], 
         'longitude':        [                  5678,       np.nan,       np.nan,       np.nan ],         
         'reserve_datetime': [ '2017-04-22 11:00:00',       np.nan,       np.nan,       np.nan ], 
         'reserve_visitors': [                    25,           35,           45,       np.nan ], 
         'visit_datetime':   [ '2017-05-23 12:00:00',       np.nan,       np.nan,       np.nan ], 
         'visit_date':       [ '2017-05-23'         , '2017-05-24', '2017-05-25', '2017-05-27' ],
         'day_of_week':      [             'Tuesday',  'Wednesday',       np.nan,       np.nan ],
         'holiday_flg':      [                     0,       np.nan,       np.nan,       np.nan ]
       }

merged_df = pd.DataFrame(data)
display(merged_df)

### Data for date_info_df
data = { 'calendar_date':     [ '2017-05-23', '2017-05-24', '2017-05-25', '2017-05-26', '2017-05-27', '2017-05-28' ], 
         'day_of_week':       [    'Tuesday',  'Wednesday',   'Thursday',     'Friday',   'Saturday',     'Sunday' ], 
         'holiday_flg':       [            0,            0,            0,            0,            1,            1 ]         
       }

date_info_df = pd.DataFrame(data)
date_info_df['calendar_date'] = pd.to_datetime(date_info_df['calendar_date']) 
display(date_info_df)

# Fix the NaN values in day_of_week and holiday_flg by inspecting data from another dataframe (date_info_df)
def fix_weekday_and_holiday(row):
    weekday = row['day_of_week']   
    holiday = row['holiday_flg']

    # search dataframe date_info_df for the appropriate value when weekday is NaN
    if (type(weekday) == float and math.isnan(weekday)):
        search_date = row['visit_date']                               
        #print('  --> weekday search_date=', search_date, 'type=', type(search_date))        
        indexes = date_info_df.index[date_info_df['calendar_date'] == search_date].tolist()
        idx = indexes[0]                
        weekday = date_info_df.at[idx,'day_of_week']
        #print('  --> weekday search_date=', search_date, 'is', weekday)        
        row['day_of_week'] = weekday        

    # search dataframe date_info_df for the appropriate value when holiday is NaN
    if (type(holiday) == float and math.isnan(holiday)):
        search_date = row['visit_date']                               
        #print('  --> holiday search_date=', search_date, 'type=', type(search_date))        
        indexes = date_info_df.index[date_info_df['calendar_date'] == search_date].tolist()
        idx = indexes[0]                
        holiday = date_info_df.at[idx,'holiday_flg']
        #print('  --> holiday search_date=', search_date, 'is', holiday)        
        row['holiday_flg'] = int(holiday)

    return row


# send every row to fix_day_of_week
merged_df = merged_df.apply(fix_weekday_and_holiday, axis=1) 

# Convert data from float to int (to remove decimal places)
merged_df['holiday_flg'] = merged_df['holiday_flg'].astype(int)

display(merged_df)

I did a few measurements so you can understand the struggle: 我做了一些测量,所以你可以理解这个斗争:

  • On a DataFrame with 6 rows, apply() takes 3.01 ms ; 在具有6行的DataFrame上, apply()需要3.01 ms ;
  • On a DataFrame with ~ 250000 rows, apply() takes 2min 51s . 在具有~ 250000行的DataFrame上, apply()需要2分51秒
  • On a DataFrame with ~ 1215000 rows, apply() takes 4min 2s . 在包含~ 1215000行的DataFrame上, apply()需要4 分钟 2秒

How do I improve the performance of this task? 如何提高此任务的性能?

you can use Index to speed up the lookup, use combine_first() to fill NaN: 您可以使用Index来加速查找,使用combine_first()来填充NaN:

cols = ["day_of_week", "holiday_flg"]
visit_date = pd.to_datetime(merged_df.visit_date)
merged_df[cols] = merged_df[cols].combine_first(
    date_info_df.set_index("calendar_date").loc[visit_date, cols].set_index(merged_df.index))

print(merged_df[cols])

the result: 结果:

 day_of_week  holiday_flg
0     Tuesday          0.0
1   Wednesday          0.0
2    Thursday          0.0
3    Saturday          1.0

This is one solution. 这是一个解决方案。 It should be efficient as there is no explicit merge or apply . 它应该是高效的,因为没有明确的mergeapply

merged_df['visit_date'] = pd.to_datetime(merged_df['visit_date']) 
date_info_df['calendar_date'] = pd.to_datetime(date_info_df['calendar_date']) 

s = date_info_df.set_index('calendar_date')['day_of_week']
t = date_info_df.set_index('day_of_week')['holiday_flg']

merged_df['day_of_week'] = merged_df['day_of_week'].fillna(merged_df['visit_date'].map(s))
merged_df['holiday_flg'] = merged_df['holiday_flg'].fillna(merged_df['day_of_week'].map(t))

Result 结果

  air_store_id area_name day_of_week genre_name  holiday_flg hpg_store_id  \
0       air_a1     Tokyo     Tuesday   Japanese          0.0       hpg_h1   
1       air_a2       NaN   Wednesday        NaN          0.0          NaN   
2       air_a3       NaN    Thursday        NaN          0.0          NaN   
3       air_a4       NaN    Saturday        NaN          1.0          NaN   

   latitude  longitude     reserve_datetime  reserve_visitors visit_date  \
0    1234.0     5678.0  2017-04-22 11:00:00              25.0 2017-05-23   
1       NaN        NaN                  NaN              35.0 2017-05-24   
2       NaN        NaN                  NaN              45.0 2017-05-25   
3       NaN        NaN                  NaN               NaN 2017-05-27   

        visit_datetime  
0  2017-05-23 12:00:00  
1                  NaN  
2                  NaN  
3                  NaN  

Explanation 说明

  • s is a pd.Series mapping calendar_date to day_of_week from date_info_df . spd.Series将calendar_date映射到date_info_df day_of_week。
  • Use pd.Series.map , which takes pd.Series as an input, to update missing values, where possible. 使用pd.Series.map (以pd.Series作为输入)在可能的情况下更新缺失值。

Edit: one can also use merge to solve the problem. 编辑:一个也可以使用merge来解决问题。 10 times faster than the old approach. 比旧方法快10倍。 (Need to make sure "visit_date" and "calendar_date" are of the same format.) (需要确保"visit_date""calendar_date"具有相同的格式。)

# don't need to `set_index` for date_info_df but select columns needed.
merged_df.merge(date_info_df[["calendar_date", "day_of_week", "holiday_flg"]], 
                left_on="visit_date", 
                right_on="calendar_date", 
                how="left") # outer should also work

The desired result will be at "day_of_week_y" and "holiday_flg_y" column right now. 所需的结果现在位于"day_of_week_y""holiday_flg_y"列。 In this approach and the map approach, we don't use the old "day_of_week" and "holiday_flg" at all. 在这种方法和map方法中,我们根本不使用旧的"day_of_week""holiday_flg" We just need to map the results from data_info_df to merged_df . 我们只需要将merged_df的结果data_info_dfmerged_df

merge can also do the job because data_info_df 's data entries are unique. merge也可以完成这项工作,因为data_info_df的数据条目是唯一的。 (No duplicates will be created.) (不会创建重复项。)


You can also try using pandas.Series.map . 您也可以尝试使用pandas.Series.map What it does is 它的作用是什么

Map values of Series using input correspondence (which can be a dict, Series, or function) 使用输入对应(可以是字典,系列或函数)映射系列的值

# set "calendar_date" as the index such that 
# mapping["day_of_week"] and mapping["holiday_flg"] will be two series
# with date_info_df["calendar_date"] as their index.
mapping = date_info_df.set_index("calendar_date")

# this line is optional (depending on the layout of data.)
merged_df.visit_date = pd.to_datetime(merged_df.visit_date)

# do replacement here.
merged_df["day_of_week"] = merged_df.visit_date.map(mapping["day_of_week"])
merged_df["holiday_flg"] = merged_df.visit_date.map(mapping["holiday_flg"])

Note merged_df.visit_date originally was of string type. 注意merged_df.visit_date最初是字符串类型。 Thus, we use 因此,我们使用

merged_df.visit_date = pd.to_datetime(merged_df.visit_date)

to make it datetime. 使它成为日期时间。

Timings date_info_df dataset and merged_df provided by karlphillip. 时序 date_info_df数据集merged_df由karlphillip提供。

date_info_df = pd.read_csv("full_date_info_data.csv")
merged_df = pd.read_csv("full_data.csv")   
merged_df.visit_date = pd.to_datetime(merged_df.visit_date)
date_info_df.calendar_date = pd.to_datetime(date_info_df.calendar_date)
cols = ["day_of_week", "holiday_flg"]
visit_date = pd.to_datetime(merged_df.visit_date)

# merge method I proprose on the top.
%timeit merged_df.merge(date_info_df[["calendar_date", "day_of_week", "holiday_flg"]], left_on="visit_date", right_on="calendar_date", how="left")
511 ms ± 34.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# HYRY's method without assigning it back
%timeit merged_df[cols].combine_first(date_info_df.set_index("calendar_date").loc[visit_date, cols].set_index(merged_df.index))
772 ms ± 11.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# HYRY's method with assigning it back
%timeit merged_df[cols] = merged_df[cols].combine_first(date_info_df.set_index("calendar_date").loc[visit_date, cols].set_index(merged_df.index))    
258 ms ± 69.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

One can see that HYRY's method runs 3 times faster if assigning the result back to the merged_df . 可以看出, 如果将结果分配回merged_df ,HYRY的方法运行速度提高了3倍。 This is why I thought HARY's method was faster than mine at first glance. 这就是为什么我认为HARY的方法乍一看比我快。 I suspect that is because of the nature of combine_first . 我怀疑这是因为combine_first的性质。 I guess that the speed of HARY's method will depend on how sparse it is in merged_df . 我想HARY方法的速度将取决于merged_df中的稀疏merged_df Thus, while assigning the results back, the columns become full; 因此,在返回结果的同时,列变满了; therefore, while rerunning it, it is faster. 因此,在重新运行时,它会更快。

The performances of the merge and combine_first methods are nearly equivalent. mergecombine_first方法的性能几乎相同。 Perhaps there can be circumstances that one is faster than another. 也许可能存在一个比另一个更快的情况。 It should be left to each user to do some tests on their datasets. 应由每个用户对其数据集进行一些测试。

Another thing to note between the two methods is that the merge method assumed every date in merged_df is contained in data_info_df . 这两种方法之间需要注意的另一件事是merge方法假设 merged_df每个日期都包含在data_info_df If there are some dates that are contained in merged_df but not data_info_df , it should return NaN . 如果merged_df中包含某些日期但不包含data_info_df ,则应返回NaN And NaN can override some part of merged_df that originally contains values! 并且NaN可以覆盖最初包含值的merged_df某些部分! This is when combine_first method should be preferred. 这是combine_first方法应该首选的时候。 See the discussion by MaxU in Pandas replace, multi column criteria 请参阅Pandas替换多列标准中 MaxU的讨论

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM