Imputing NA's with millions of rows of data in R

Question

I have an orders dataset that contains sales order and sales order line information. Below is a screen shot of the first few columns of data:

Where sales order is the unique but can have multiple sales order line's per sales order. 20% of the data is what we call remakes, which is identified because the sales order number won't match the column for Original Order number. We are trying to build a prediction model to predict whether a model will be returned or not. Unfortunately there's 3 columns (width, height and fabric number not shown) that have NA's for the sales orders that were remakes. I'm trying to impute those NA's with the values of the original order number.

This is the code I have:

for (i in 1:length(hd$SALES_ORDER)){
  if (is.na(hd$WIDTH[i]) == TRUE){
    hd$WIDTH[i] = hd$WIDTH[hd$ORIGINAL_ORDER[i] == hd$SALES_ORDER][1]
  }
}

1 being the first value returned since there could be multiple sales lines. I attempted to match sales order line and original order line but kept getting a 'value length' error.

My data has 3 million+ rows and 400k na's. The for loop is running but it's been running for an hour. I'm curious if there's a more efficient way to accomplish my task?

Thanks

Answer 1

This seems unusually slow. Even without any optimization (eg using data.table), the approach below only takes a couple seconds to take a 2M row data frame and fill in NAs for 1 million rows from the preceding order with the same ORIGINAL_ORDER.

library(dplyr); library(tidyr)
my_data_million <- data.frame(stringsAsFactors = FALSE, # not necessary for R >4.0.0
                      ORIGINAL_ORDER = rep(1:1000000, 2),
                      SALES_ORDER = 1000000:2999999,
                      WIDTH = c(sample(1:50, 1000000, replace = TRUE), rep(NA, 1000000))
) %>%
slice_sample(n = 2E6, replace = FALSE)   # Shuffling just to show it's still fast


my_data_million %>%
  arrange(ORIGINAL_ORDER, SALES_ORDER) %>%
  group_by(ORIGINAL_ORDER) %>%
  tidyr::fill(WIDTH, .direction = "updown") %>%    #EDIT
  ungroup()

Imputing NA's with millions of rows of data in R

Question

1 answers

solution1
1 ACCPTED 2021-02-18 00:53:57

Imputing NA's with millions of rows of data in R

Question

1 answers

solution1 1 ACCPTED 2021-02-18 00:53:57

solution1
1 ACCPTED 2021-02-18 00:53:57