简体   繁体   中英

R: Reading several Excel files into different dataframes and exploring these at the same time

I am writing some R code to convert 400 Excel files into machine readable flat files. On receiving these Excel files, there is a tight turnaround time and there is no possibility of receiving the initial files in a machine readable format.

I have the R code which will pull out the data in the rows and columns that we need, delete the spaces and present it nicely in a MR format. The issue I need to tackle now is that I need to confirm that each of the 400 files are in the correct format for the function to work properly. To do this, I just want to check lots of simple things, eg that the column 'title' is in cell A9 in each of the Excel files.

I am new to R and really struggling to write a function that will let me examine all 400 files in one go.

The closest I have got is this:

template_dir <- "file path of main directory"
files <- list.files(path=template_dir, pattern="*.xlsx", full.names=TRUE, recursive=TRUE) 
df.files <- lapply (files, read_excel)

This then generates a list with 400 elements. I can load up each of these individually no problem with

df.files [1]

But, if I try and use:

title_loc <- which (df.files [1] == "Title", arr.ind = TRUE)

It does not work, I just get an empty value. I know the 'which' function works though, as when I just read a single Excel file in to R as a df (or put the file path in), then the 'which' function works fine and returns [1,9] as expected.

The 400 files are spread over several folders (nothing I can do about that either), and I can get a list of all the files using list.files. What I want to do is execute a series of simple checks (reference for 'title'; reference for 'age'; reference for 'location' and so on) to confirm that all 400 files are laid out in the same way. So it would be ideal to list the output for 'title' in one df, so I can then check that the column is '1' for all 400 and the row is '9' for all 400.

I think what I want is this:

title_loc <- which (*loop to cycle through every element in df.files* == "Title", arr.ind = TRUE)

But the way to write the loops is defeating me. Would it be easier to get the filepath for all 400 Excel files in a list and then just cycle through those (rather than using lapply to import all the data)?

Thanks

I'm not sure what machine readable format is, but if you want to loop through all Excel files in a folder and load all into Excel, the following code samples will do that for you.

# load names of excel files 
files = list.files(path = "C:\\your_path_here\\", full.names = TRUE, pattern = ".xlsx")

# create function to read multiple sheets per excel file
read_excel_allsheets <- function(filename, tibble = FALSE) {
  sheets <- readxl::excel_sheets(filename)
  sapply(sheets, function(f) as.data.frame(readxl::read_excel(filename, sheet = f)), 
         simplify = FALSE)
}

# execute function for all excel files in "files"
all_data <- lapply(files, read_excel_allsheets)

Or

library(XLConnect)

testDir <- "C:\\your_path_here\\"

re_file <- ".+\\.xls.?"
testFiles <- list.files(testDir, re_file, full.names = TRUE)

# This function rbinds in a single dataframe
# the content of multiple sheets in the same workbook
# (assuming that all the sheets have the same column types)
rbindAllSheets <- function(file) {
  wb <- loadWorkbook(file)
  sheets <- getSheets(wb)
  do.call(rbind,
          lapply(sheets, function(sheet) {
            readWorksheet(wb, sheet)
          })
  )
}

# Getting a single dataframe for all the Excel files
result <- do.call(rbind, lapply(testFiles, rbindAllSheets))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM