I read data files from a directory where I don't know the number or the name of the files. Each files a data frame (as parquet file). I can read that files. But how to name the results?
I would like to have something like a named list where the filename is the name of the element. I don't know how to do this in R. In Python I would use dictionaries like this
file_names = ['A.parquet', 'B.parquet']
all_data = {}
for fn in file_names:
data = pd.read_parquet(fn)
all_data[fn] = data
How can I solve this in R?
library("arrow")
file_names = c('a.parquet', 'B.parquet')
# "named vector"?
daten = c()
for (pf in file_names) {
# name of data frame (filename without suffix)
df_name <- strsplit(pf, ".", fixed=TRUE)[[1]][1]
df <- arrow::read_parquet(pf)
daten[df_name] = df
}
This doesn't work because I got this error
number of items to replace is not a multiple of replacement length
Each arrow::read_parquet()
call returns a data frame. You want to store the results of your loop using a list of data frames. In particular, you are looking for a named list.
file_names <- c('a.parquet', 'B.parquet')
## loop through files (can be replaced by a one-line `lapply` call)
daten <- list() ## not c()
for (i in 1:length(file_names)) {
daten[[i]] <- arrow::read_parquet(file_names[i])
}
## grab filename without suffix
names(daten) <- gsub(".parquet", "", file_names)
To access list element by name, use daten[["a"]]
and daten[["B"]]
.
Remark: Since the length of the list is known, it is better to initialize it with a fixed length, so that the list does not grow in size during the loop.
daten <- vector("list", length(file_names))
In addition, if you know about lapply
function, you can replace the loop with the following so that you don't even need to bother about list initialization.
daten <- lapply(file_names, arrow::read_parquet)
As a result, the code can be shortened to:
daten <- lapply(file_names, arrow::read_parquet)
names(daten) <- gsub(".parquet", "", file_names)
In the tidyverse you would use purrr
. This is basically the same as the lapply()
or sapply()
approach, but in a different ecosystem.
library(arrow)
library(purrr)
file_names = c('a.parquet', 'B.parquet')
daten <- file_names %>%
set_names(tools::file_path_sans_ext) %>%
map(read_parquet)
You would access each list item through the usual ways.
daten$a
daten$B
# or
daten[["a"]]
daten[["B"]]
Explaination
The pipe operator %>%
is an extremely common thing to run into in R these days. It is from the magrittr
package, but is also exported from various other tidyverse
packages, including purrr
.
The pipe takes the left hand argument and enters it as the first argument on the right side expression. So f(x, y)
can be written as x %>% f(y)
. This is useful to chain together expressions. R itself has a native pipe operator |>
starting with version 4.1.0.
file_names
is an unnamed character vector of the file names. set_names()
will make this a named vector by applying the function file_path_sans_ext()
to file_names
. This removes the file extension, so each element is named according to its name before the extension.map()
will iterate over each element of the vector, returning a list named according to the names of the vector elements. Each iteration runs the read_parquet
function on the input (the file name).You can used named lists like so.
You can either use the names directly
sapply(file_names, arrow::read_parquet,USE.NAMES = TRUE,simplify = FALSE)
or set them after with whatever function you want to apply
setNames(lapply(file_names, arrow::read_parquet), str_extract(file_names, '(^.+)(\\.)'))
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.