简体   繁体   中英

R convert dataframe with character lists to list with correct R types efficiently

I use crateDB to load a table as a dataframe into R. The problem is, that crateDB sends arrays as a comma separated string. Therefore I want to convert all arrays to the correct R type. I also want to convert the dataframe to a list, since its possible to use objects in crateDB, which wouldn't work with a dataframe. This conversion is too slow at the moment, so I tried several things to improve the performance.

If I have the following dataframe:

df <- data.frame(
  id = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10),
  position = c(
    "{\"82.81864\",\"82.586235\",\"82.35383\"}",
    "{\"83.81864\",\"83.586235\",\"83.35383\"}",
    "{\"84.81864\",\"84.586235\",\"84.35383\"}",
    "{\"85.81864\",\"85.586235\",\"85.35383\"}",
    "{\"86.81864\",\"86.586235\",\"86.35383\"}",
    "{\"87.81864\",\"87.586235\",\"87.35383\"}",
    "{\"88.81864\",\"88.586235\",\"88.35383\"}",
    "{\"89.81864\",\"89.586235\",\"89.35383\"}",
    "{\"90.81864\",\"90.586235\",\"90.35383\"}",
    "{\"91.81864\",\"91.586235\",\"91.35383\"}"
  ),
  vcontrol = c(
    "{\"t\",\"t\",\"t\",\"t\"}","{\"f\",\"f\",\"f\",\"t\"}",
    "{\"f\",\"t\",\"f\",\"t\"}", "{\"t\",\"t\",\"f\",\"t\"}",
    "{\"t\",\"t\",\"f\",\"t\"}", "{\"t\",\"f\",\"f\",\"t\"}",
    "{\"t\",\"f\",\"f\",\"t\"}", "{\"t\",\"t\",\"f\",\"t\"}",
    "{\"t\",\"t\",\"f\",\"t\"}", "{\"t\",\"t\",\"f\",\"f\"}"
  )
)

The resulting list after conversion should look like this: 转换列表

I started with two for loops, which were really slow for big datasets. Then I tried the apply functions:

convertDF = function(dataFrame, dataTypes){
  dimension <- dim(x = dataFrame)
  names <- names(x = dataFrame)
  
  asList <- lapply(dataFrame, as.list)
  
  for(row in seq_len(length(asList))){
    asList[[row]] <- lapply(asList[[x]], convertToRType, type = dataTypes[row])
  }
  
  data <- list()
  for(datarow in seq_len(dimension[1])){
    tempData <- list()
    for(datacol in seq_len(dimension[2])){
      tempData[[names[datacol]]] <- asList[[datacol]][[datarow]]
    }
    data[[datarow]] <- tempData
  }
  return(data)
}

The convertToRType function uses the type that is used in the database as a parameter, so it can tell wether to convert it to an integer, double or logical. I do this with if identical checks. For arrays I first remove all unneeded characters, split at , and then use as.double over the entire vector for example.

  data <- str_replace_all(
    string = rawData,
    pattern = c("\\{" = "", "\\}" = "", "\"" = "")
  )
  data <- str_split(string = data, pattern = ",")[[1]]

I did this, because I wanted to use the multithreading capabilities of lapply, but turns out its even slower on windows. But with this function the conversion only took half as long as before. I still don't like this solution, though. The code isn't clean and doesn't seem to be a good performance.

Can anybody tell me how to do this conversion as efficiently as possible? I'm running out of ideas.

Here's a solution using dplyr to do the conversions to right types inside the dataframe and then purrr to transpose to list and simplify . Watch out for "gotchas" like "t" <> TRUE

library(dplyr)
library(stringr)
library(purrr)
x <- 
   df %>% 
   mutate(position = str_replace_all(
      string = .$position,
      pattern = c("\\{" = "", "\\}" = "", "\"" = "")
   ) %>% str_split(string = ., pattern = ",")
   ) %>%
   mutate(vcontrol = str_replace_all(
      string = .$vcontrol,
      pattern = c("\\{" = "", "\\}" = "", "\"" = "")
   ) %>%
      str_replace_all(string = ., c("t" = "TRUE", 
                                             "f" = "FALSE")) %>%
      str_split(string = ., pattern = ",")) %>%
   rowwise() %>%
   mutate(position = list(as.numeric(unlist(position)))) %>%
   mutate(vcontrol = list(as.logical(unlist(vcontrol))))

converted_df <- transpose(x) %>% simplify_all()
str(converted_df)
#> List of 10
#>  $ :List of 3
#>   ..$ id      : num 1
#>   ..$ position: num [1:3] 82.8 82.6 82.4
#>   ..$ vcontrol: logi [1:4] TRUE TRUE TRUE TRUE
#>  $ :List of 3
#>   ..$ id      : num 2
#>   ..$ position: num [1:3] 83.8 83.6 83.4
#>   ..$ vcontrol: logi [1:4] FALSE FALSE FALSE TRUE
#>  $ :List of 3
#>   ..$ id      : num 3
#>   ..$ position: num [1:3] 84.8 84.6 84.4
#>   ..$ vcontrol: logi [1:4] FALSE TRUE FALSE TRUE
#>  $ :List of 3
#>   ..$ id      : num 4
#>   ..$ position: num [1:3] 85.8 85.6 85.4
#>   ..$ vcontrol: logi [1:4] TRUE TRUE FALSE TRUE
#>  $ :List of 3
#>   ..$ id      : num 5
#>   ..$ position: num [1:3] 86.8 86.6 86.4
#>   ..$ vcontrol: logi [1:4] TRUE TRUE FALSE TRUE
#>  $ :List of 3
#>   ..$ id      : num 6
#>   ..$ position: num [1:3] 87.8 87.6 87.4
#>   ..$ vcontrol: logi [1:4] TRUE FALSE FALSE TRUE
#>  $ :List of 3
#>   ..$ id      : num 7
#>   ..$ position: num [1:3] 88.8 88.6 88.4
#>   ..$ vcontrol: logi [1:4] TRUE FALSE FALSE TRUE
#>  $ :List of 3
#>   ..$ id      : num 8
#>   ..$ position: num [1:3] 89.8 89.6 89.4
#>   ..$ vcontrol: logi [1:4] TRUE TRUE FALSE TRUE
#>  $ :List of 3
#>   ..$ id      : num 9
#>   ..$ position: num [1:3] 90.8 90.6 90.4
#>   ..$ vcontrol: logi [1:4] TRUE TRUE FALSE TRUE
#>  $ :List of 3
#>   ..$ id      : num 10
#>   ..$ position: num [1:3] 91.8 91.6 91.4
#>   ..$ vcontrol: logi [1:4] TRUE TRUE FALSE FALSE

Your data

df <- data.frame(
   id = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10),
   position = c(
      "{\"82.81864\",\"82.586235\",\"82.35383\"}",
      "{\"83.81864\",\"83.586235\",\"83.35383\"}",
      "{\"84.81864\",\"84.586235\",\"84.35383\"}",
      "{\"85.81864\",\"85.586235\",\"85.35383\"}",
      "{\"86.81864\",\"86.586235\",\"86.35383\"}",
      "{\"87.81864\",\"87.586235\",\"87.35383\"}",
      "{\"88.81864\",\"88.586235\",\"88.35383\"}",
      "{\"89.81864\",\"89.586235\",\"89.35383\"}",
      "{\"90.81864\",\"90.586235\",\"90.35383\"}",
      "{\"91.81864\",\"91.586235\",\"91.35383\"}"
   ),
   vcontrol = c(
      "{\"t\",\"t\",\"t\",\"t\"}","{\"f\",\"f\",\"f\",\"t\"}",
      "{\"f\",\"t\",\"f\",\"t\"}", "{\"t\",\"t\",\"f\",\"t\"}",
      "{\"t\",\"t\",\"f\",\"t\"}", "{\"t\",\"f\",\"f\",\"t\"}",
      "{\"t\",\"f\",\"f\",\"t\"}", "{\"t\",\"t\",\"f\",\"t\"}",
      "{\"t\",\"t\",\"f\",\"t\"}", "{\"t\",\"t\",\"f\",\"f\"}"
   )
)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM