简体   繁体   中英

Imputing NA's with millions of rows of data in R

I have an orders dataset that contains sales order and sales order line information. Below is a screen shot of the first few columns of data:

在此处输入图像描述

Where sales order is the unique but can have multiple sales order line's per sales order. 20% of the data is what we call remakes, which is identified because the sales order number won't match the column for Original Order number. We are trying to build a prediction model to predict whether a model will be returned or not. Unfortunately there's 3 columns (width, height and fabric number not shown) that have NA's for the sales orders that were remakes. I'm trying to impute those NA's with the values of the original order number.

This is the code I have:

for (i in 1:length(hd$SALES_ORDER)){
  if (is.na(hd$WIDTH[i]) == TRUE){
    hd$WIDTH[i] = hd$WIDTH[hd$ORIGINAL_ORDER[i] == hd$SALES_ORDER][1]
  }
}

1 being the first value returned since there could be multiple sales lines. I attempted to match sales order line and original order line but kept getting a 'value length' error.

My data has 3 million+ rows and 400k na's. The for loop is running but it's been running for an hour. I'm curious if there's a more efficient way to accomplish my task?

Thanks

This seems unusually slow. Even without any optimization (eg using data.table), the approach below only takes a couple seconds to take a 2M row data frame and fill in NAs for 1 million rows from the preceding order with the same ORIGINAL_ORDER.

library(dplyr); library(tidyr)
my_data_million <- data.frame(stringsAsFactors = FALSE, # not necessary for R >4.0.0
                      ORIGINAL_ORDER = rep(1:1000000, 2),
                      SALES_ORDER = 1000000:2999999,
                      WIDTH = c(sample(1:50, 1000000, replace = TRUE), rep(NA, 1000000))
) %>%
slice_sample(n = 2E6, replace = FALSE)   # Shuffling just to show it's still fast


my_data_million %>%
  arrange(ORIGINAL_ORDER, SALES_ORDER) %>%
  group_by(ORIGINAL_ORDER) %>%
  tidyr::fill(WIDTH, .direction = "updown") %>%    #EDIT
  ungroup()

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM