I use crateDB to load a table as a dataframe into R. The problem is, that crateDB sends arrays as a comma separated string. Therefore I want to convert all arrays to the correct R type. I also want to convert the dataframe to a list, since its possible to use objects in crateDB, which wouldn't work with a dataframe. This conversion is too slow at the moment, so I tried several things to improve the performance.
If I have the following dataframe:
df <- data.frame(
id = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10),
position = c(
"{\"82.81864\",\"82.586235\",\"82.35383\"}",
"{\"83.81864\",\"83.586235\",\"83.35383\"}",
"{\"84.81864\",\"84.586235\",\"84.35383\"}",
"{\"85.81864\",\"85.586235\",\"85.35383\"}",
"{\"86.81864\",\"86.586235\",\"86.35383\"}",
"{\"87.81864\",\"87.586235\",\"87.35383\"}",
"{\"88.81864\",\"88.586235\",\"88.35383\"}",
"{\"89.81864\",\"89.586235\",\"89.35383\"}",
"{\"90.81864\",\"90.586235\",\"90.35383\"}",
"{\"91.81864\",\"91.586235\",\"91.35383\"}"
),
vcontrol = c(
"{\"t\",\"t\",\"t\",\"t\"}","{\"f\",\"f\",\"f\",\"t\"}",
"{\"f\",\"t\",\"f\",\"t\"}", "{\"t\",\"t\",\"f\",\"t\"}",
"{\"t\",\"t\",\"f\",\"t\"}", "{\"t\",\"f\",\"f\",\"t\"}",
"{\"t\",\"f\",\"f\",\"t\"}", "{\"t\",\"t\",\"f\",\"t\"}",
"{\"t\",\"t\",\"f\",\"t\"}", "{\"t\",\"t\",\"f\",\"f\"}"
)
)
The resulting list after conversion should look like this:
I started with two for loops, which were really slow for big datasets. Then I tried the apply functions:
convertDF = function(dataFrame, dataTypes){
dimension <- dim(x = dataFrame)
names <- names(x = dataFrame)
asList <- lapply(dataFrame, as.list)
for(row in seq_len(length(asList))){
asList[[row]] <- lapply(asList[[x]], convertToRType, type = dataTypes[row])
}
data <- list()
for(datarow in seq_len(dimension[1])){
tempData <- list()
for(datacol in seq_len(dimension[2])){
tempData[[names[datacol]]] <- asList[[datacol]][[datarow]]
}
data[[datarow]] <- tempData
}
return(data)
}
The convertToRType function uses the type that is used in the database as a parameter, so it can tell wether to convert it to an integer, double or logical. I do this with if identical checks. For arrays I first remove all unneeded characters, split at , and then use as.double over the entire vector for example.
data <- str_replace_all(
string = rawData,
pattern = c("\\{" = "", "\\}" = "", "\"" = "")
)
data <- str_split(string = data, pattern = ",")[[1]]
I did this, because I wanted to use the multithreading capabilities of lapply, but turns out its even slower on windows. But with this function the conversion only took half as long as before. I still don't like this solution, though. The code isn't clean and doesn't seem to be a good performance.
Can anybody tell me how to do this conversion as efficiently as possible? I'm running out of ideas.
Here's a solution using dplyr
to do the conversions to right types inside the dataframe and then purrr
to transpose
to list and simplify
. Watch out for "gotchas" like "t" <> TRUE
library(dplyr)
library(stringr)
library(purrr)
x <-
df %>%
mutate(position = str_replace_all(
string = .$position,
pattern = c("\\{" = "", "\\}" = "", "\"" = "")
) %>% str_split(string = ., pattern = ",")
) %>%
mutate(vcontrol = str_replace_all(
string = .$vcontrol,
pattern = c("\\{" = "", "\\}" = "", "\"" = "")
) %>%
str_replace_all(string = ., c("t" = "TRUE",
"f" = "FALSE")) %>%
str_split(string = ., pattern = ",")) %>%
rowwise() %>%
mutate(position = list(as.numeric(unlist(position)))) %>%
mutate(vcontrol = list(as.logical(unlist(vcontrol))))
converted_df <- transpose(x) %>% simplify_all()
str(converted_df)
#> List of 10
#> $ :List of 3
#> ..$ id : num 1
#> ..$ position: num [1:3] 82.8 82.6 82.4
#> ..$ vcontrol: logi [1:4] TRUE TRUE TRUE TRUE
#> $ :List of 3
#> ..$ id : num 2
#> ..$ position: num [1:3] 83.8 83.6 83.4
#> ..$ vcontrol: logi [1:4] FALSE FALSE FALSE TRUE
#> $ :List of 3
#> ..$ id : num 3
#> ..$ position: num [1:3] 84.8 84.6 84.4
#> ..$ vcontrol: logi [1:4] FALSE TRUE FALSE TRUE
#> $ :List of 3
#> ..$ id : num 4
#> ..$ position: num [1:3] 85.8 85.6 85.4
#> ..$ vcontrol: logi [1:4] TRUE TRUE FALSE TRUE
#> $ :List of 3
#> ..$ id : num 5
#> ..$ position: num [1:3] 86.8 86.6 86.4
#> ..$ vcontrol: logi [1:4] TRUE TRUE FALSE TRUE
#> $ :List of 3
#> ..$ id : num 6
#> ..$ position: num [1:3] 87.8 87.6 87.4
#> ..$ vcontrol: logi [1:4] TRUE FALSE FALSE TRUE
#> $ :List of 3
#> ..$ id : num 7
#> ..$ position: num [1:3] 88.8 88.6 88.4
#> ..$ vcontrol: logi [1:4] TRUE FALSE FALSE TRUE
#> $ :List of 3
#> ..$ id : num 8
#> ..$ position: num [1:3] 89.8 89.6 89.4
#> ..$ vcontrol: logi [1:4] TRUE TRUE FALSE TRUE
#> $ :List of 3
#> ..$ id : num 9
#> ..$ position: num [1:3] 90.8 90.6 90.4
#> ..$ vcontrol: logi [1:4] TRUE TRUE FALSE TRUE
#> $ :List of 3
#> ..$ id : num 10
#> ..$ position: num [1:3] 91.8 91.6 91.4
#> ..$ vcontrol: logi [1:4] TRUE TRUE FALSE FALSE
Your data
df <- data.frame(
id = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10),
position = c(
"{\"82.81864\",\"82.586235\",\"82.35383\"}",
"{\"83.81864\",\"83.586235\",\"83.35383\"}",
"{\"84.81864\",\"84.586235\",\"84.35383\"}",
"{\"85.81864\",\"85.586235\",\"85.35383\"}",
"{\"86.81864\",\"86.586235\",\"86.35383\"}",
"{\"87.81864\",\"87.586235\",\"87.35383\"}",
"{\"88.81864\",\"88.586235\",\"88.35383\"}",
"{\"89.81864\",\"89.586235\",\"89.35383\"}",
"{\"90.81864\",\"90.586235\",\"90.35383\"}",
"{\"91.81864\",\"91.586235\",\"91.35383\"}"
),
vcontrol = c(
"{\"t\",\"t\",\"t\",\"t\"}","{\"f\",\"f\",\"f\",\"t\"}",
"{\"f\",\"t\",\"f\",\"t\"}", "{\"t\",\"t\",\"f\",\"t\"}",
"{\"t\",\"t\",\"f\",\"t\"}", "{\"t\",\"f\",\"f\",\"t\"}",
"{\"t\",\"f\",\"f\",\"t\"}", "{\"t\",\"t\",\"f\",\"t\"}",
"{\"t\",\"t\",\"f\",\"t\"}", "{\"t\",\"t\",\"f\",\"f\"}"
)
)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.