简体   繁体   中英

Speedy replacement of NA by strings using Data.Table or Rcpp in R

I have a big table: 10M rows by 33 columns, of which 28 columns have some NA values. These NA values need to be patched using locf() . I read a few threads ( efficiently locf by groups in a single R data.table and na.locf and inverse.rle in Rcpp ) on this topic. However, these threads are about replacing numeric vectors. I am not too familiar with Rcpp so I don't know how to change their code to cater to strings---my data are all strings.

Here are my sample data:

Input Data

Sample_File = structure(list(SO = c(112, 112, 112, 112, 113, 113, 113, 113), 
    Product.ID = c("AB123", "CD234", "DE345", "EF456", "FG456", 
    "GH567", "HI678", "IJ789"), Name = c(NA, NA, NA, "Human Being", 
    NA, "Lion", NA, "Bird"), Family = c(NA, NA, NA, "Homo Sapiens", 
    NA, NA, NA, "Passeridae"), SL1_Continent = c("Asia", NA, 
    "Asia", "Asia", NA, NA, NA, "Australia"), SL2_Country = c("China", 
    "China", NA, NA, NA, NA, NA, "Australia"), SL3_Direction = c("East", 
    NA, "East", "East", NA, NA, NA, "West"), Expiration_FY = c(2021, 
    NA, 2018, NA, 2012, 2012, NA, 2012), Flag = c("Y", NA, "N", 
    "N", NA, NA, NA, "TBD"), Insured = c("No", NA, NA, NA, NA, 
    NA, NA, "Yes"), Revenue = c(0, 478227.44, 0, 0, 0, 0, 125550.4, 
    44314.51), Quantity = c(1000, 100, 100, 4, 6, 6, 4, 6)), .Names = c("SO", 
"Product.ID", "Name", "Family", "SL1_Continent", "SL2_Country", 
"SL3_Direction", "Expiration_FY", "Flag", "Insured", "Revenue", 
"Quantity"), row.names = c(NA, 8L), class = "data.frame")

Here's my code using data.table :

data.table::setDT(Sample_File)
cols <- c("Name","Family","SL1_Continent","SL2_Country","SL3_Direction","Expiration_FY","Flag","Insured")
Sample_File[, (cols):=lapply(.SD, function(x){na.locf(x,fromLast = TRUE,na.rm=TRUE)}), by = SO, .SDcols = cols]

Expected Output:

Output = structure(list(SO = c(112, 112, 112, 112, 113, 113, 113, 113), 
    Product.ID = c("AB123", "CD234", "DE345", "EF456", "FG456", 
    "GH567", "HI678", "IJ789"), Name = c("Human Being", "Human Being", 
    "Human Being", "Human Being", "Lion", "Lion", "Bird", "Bird"
    ), Family = c("Homo Sapiens", "Homo Sapiens", "Homo Sapiens", 
    "Homo Sapiens", "Passeridae", "Passeridae", "Passeridae", 
    "Passeridae"), SL1_Continent = c("Asia", "Asia", "Asia", 
    "Asia", "Australia", "Australia", "Australia", "Australia"
    ), SL2_Country = c("China", "China", "China", "China", "Australia", 
    "Australia", "Australia", "Australia"), SL3_Direction = c("East", 
    "East", "East", "East", "West", "West", "West", "West"), 
    Expiration_FY = c(2021, 2018, 2018, 2021, 2012, 2012, 2012, 
    2012), Flag = c("Y", "N", "N", "N", "TBD", "TBD", "TBD", 
    "TBD"), Insured = c("No", "No", "No", "No", "Yes", "Yes", 
    "Yes", "Yes"), Revenue = c(0, 478227.44, 0, 0, 0, 0, 125550.4, 
    44314.51), Quantity = c(1000, 100, 100, 4, 6, 6, 4, 6)), .Names = c("SO", 
"Product.ID", "Name", "Family", "SL1_Continent", "SL2_Country", 
"SL3_Direction", "Expiration_FY", "Flag", "Insured", "Revenue", 
"Quantity"), row.names = c(NA, -8L), class = "data.frame")

While the above code takes fraction of second to execute, it takes ~10 minutes to process one column in my original data-set, which translates to ~280 minutes to process 28 columns even with data.table .

I am assuming that I am not really utilizing the power of data.table above. I am not really sure. I'd sincerely appreciate any help to speed up na.locf() function.

Is there any more efficient method to replace NA above?

I simplified the problem for the purpose of this example, but I guess it is easy enough to generalize. The code below defines the locppf function in Rcpp using C++11 syntax:

#include <Rcpp.h>
using namespace Rcpp;

// [[Rcpp::plugins(cpp11)]]

using Map = std::unordered_map<double, int> ;
using Pair = Map::value_type ;

// [[Rcpp::export]]
CharacterVector locppf(NumericVector g, CharacterVector s) {
  auto n = g.size() ;
  CharacterVector out = clone(s) ;

  Map map ;
  for(int i=n-1; i>=0; i--){
    double value = g[i] ;
    auto it = map.find( value ) ;

    if( it == map.end() ){
      map.insert( Pair(value, i) ) ;
    } else {
      // if the current value is NA, replace it with the data at correct idx  
      auto current = s[i] ;
      if( CharacterVector::is_na( current ) ){
        out[i] = s[ it->second ] ;
      } else {
        it->second = i ;
      } 
    }
  }
  return out ;
}

The idea is to define a map to track the index of the last time we've seen something that was not NA in the group.I'm using std::unordered_map<double, int> as the map because your example also used a numeric vector for the group.

Let's break the relevant nuggets:

if( it == map.end() ){
  map.insert( Pair(value, i) ) ;
} 

Here we check if the map has already seen the current value, and if not we retain the current index.

      auto current = s[i] ;
      if( CharacterVector::is_na( current ) ){
        out[i] = s[ it->second ] ;
      } else {
        it->second = i ;
      } 

Here we check if the current value is NA with CharacterVector::is_na .

If it is we fill the result vector with the value that is in the index we retained earlier.

If not, we change the index that is remembered by the map for this group.

Now let's give ourselves some data:

library("zoo")
library("dplyr")
library("data.table")

with_holes <- function(x, p = .2){
  n <- length(x)
  x[ sample(n, n*p) ] <- NA
  x
}

n <- 1e6
x <- sample( as.numeric(1:100), n, replace= TRUE )
y <- with_holes( sample( letters, n, replace = TRUE) )
d <- data_frame( x = x, y = y )

And measure timing with various options:

Using dplyr syntax with group_by , mutate and na.locf

> system.time( d %>% group_by(x) %>% mutate( y = na.locf(y, fromLast = TRUE, na.rm = FALSE) ) )
  user  system elapsed 
  0.173   0.023   0.198 

Using data.table syntax with na.locf . I don't guarantee this is the best data.table way to do this though.

> d2 <- as.data.table(d)
> system.time( d2[ , y := na.locf(y, fromLast = TRUE, na.rm = FALSE) , x ]  )
  user  system elapsed 
  0.159   0.030   0.188 

With out custom locppf function:

> system.time( locppf(d$x, d$y) )
  user  system elapsed 
  0.028   0.001   0.028   

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM