简体   繁体   中英

Name a variable or object based on the value of another variable in R

I read data files from a directory where I don't know the number or the name of the files. Each files a data frame (as parquet file). I can read that files. But how to name the results?

I would like to have something like a named list where the filename is the name of the element. I don't know how to do this in R. In Python I would use dictionaries like this

file_names = ['A.parquet', 'B.parquet']

all_data = {}

for fn in file_names:
    data = pd.read_parquet(fn)
    all_data[fn] = data

How can I solve this in R?

library("arrow")

file_names = c('a.parquet', 'B.parquet')

# "named vector"?
daten = c()

for (pf in file_names) {
    # name of data frame (filename without suffix)
    df_name <- strsplit(pf, ".", fixed=TRUE)[[1]][1]

    df <- arrow::read_parquet(pf)

    daten[df_name] = df
}

This doesn't work because I got this error

number of items to replace is not a multiple of replacement length

Each arrow::read_parquet() call returns a data frame. You want to store the results of your loop using a list of data frames. In particular, you are looking for a named list.

file_names <- c('a.parquet', 'B.parquet')

## loop through files (can be replaced by a one-line `lapply` call)
daten <- list()  ## not c()
for (i in 1:length(file_names)) {
  daten[[i]] <- arrow::read_parquet(file_names[i])
}

## grab filename without suffix
names(daten) <- gsub(".parquet", "", file_names)

To access list element by name, use daten[["a"]] and daten[["B"]] .


Remark: Since the length of the list is known, it is better to initialize it with a fixed length, so that the list does not grow in size during the loop.

daten <- vector("list", length(file_names))

In addition, if you know about lapply function, you can replace the loop with the following so that you don't even need to bother about list initialization.

daten <- lapply(file_names, arrow::read_parquet)

As a result, the code can be shortened to:

daten <- lapply(file_names, arrow::read_parquet)
names(daten) <- gsub(".parquet", "", file_names)

In the tidyverse you would use purrr . This is basically the same as the lapply() or sapply() approach, but in a different ecosystem.

library(arrow)
library(purrr)

file_names = c('a.parquet', 'B.parquet')

daten <- file_names %>% 
  set_names(tools::file_path_sans_ext) %>% 
  map(read_parquet)

You would access each list item through the usual ways.

daten$a
daten$B

# or

daten[["a"]]
daten[["B"]]

Explaination

The pipe operator %>% is an extremely common thing to run into in R these days. It is from the magrittr package, but is also exported from various other tidyverse packages, including purrr .

The pipe takes the left hand argument and enters it as the first argument on the right side expression. So f(x, y) can be written as x %>% f(y) . This is useful to chain together expressions. R itself has a native pipe operator |> starting with version 4.1.0.

  • file_names is an unnamed character vector of the file names.
  • set_names() will make this a named vector by applying the function file_path_sans_ext() to file_names . This removes the file extension, so each element is named according to its name before the extension.
  • map() will iterate over each element of the vector, returning a list named according to the names of the vector elements. Each iteration runs the read_parquet function on the input (the file name).

You can used named lists like so.

You can either use the names directly

sapply(file_names, arrow::read_parquet,USE.NAMES = TRUE,simplify = FALSE)

or set them after with whatever function you want to apply

setNames(lapply(file_names, arrow::read_parquet), str_extract(file_names, '(^.+)(\\.)'))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM