I understand that readxl
can be used to read in multiple worksheets from a workbook . However, I am struggling to extend this and vectorise this across many workbooks with different sheet names and number of sheets and data therein.
I demonstrate using the Enron spreadsheet data which is a bunch of .xlsx files that I downloaded.
head(list.files("../data/enron_spreadsheets/"), 3)
[1] "albert_meyers__1__1-25act.xlsx"
[2] "albert_meyers__2__1-29act.xlsx"
[3] "andrea_ring__10__ENRONGAS(1200).xlsx"
To make it manageable, we sample.
# Set the path to your directory of Enron spreadsheets here
enron_path <- "../data/enron_spreadsheets/"
# Set the sample size for testing here
sample_size <- 100
all_paths <- list.files(enron_path,
full.names = TRUE)
# For testing, look at n (sample_size) random workbooks.
set.seed(1337)
sample_paths <- sample(all_paths, sample_size)
paths <- sample_paths
Inspecting these workbooks and counting the number of worksheets therein reveals they have different number of sheets and contain different data.
# purr package
# https://jennybc.github.io/purrr-tutorial/index.html
sheet_count <- purrr::map(paths, readxl::excel_sheets) %>%
purrr::map(length) %>%
unlist()
hist(sheet_count, main = "")
However, to load all the sheets in a workbook into a list of data frames , we need to:
Use purrr::map()
to iterate sheet reading.
books <- dplyr::data_frame(filename = basename(paths), path = paths, sheet_name = purrr::map(paths, readxl::excel_sheets) ) %>% dplyr::mutate(id = as.character(row_number())) books # A tibble: 100 x 4 filename <chr> 1 kenneth_lay__19485__Mlp_1109.xlsx 2 kate_symes__18980__SP 15 pages.xls 3 chris_germany__1821__newpower-purc 4 john_griffith__15991__Forwards Det 5 jane_tholt__13278__bid2001A.xlsx 6 gerald_nemec__11481__EOLfieldnames 7 stacey_white__39009__Power RT Serv 8 eric_saibi__9766__012302.xlsx 9 david_delainey__8083__ENA Status o 10 daren_farmer__5035__HPLN0405.xlsx # ... with 90 more rows, and 3 # more variables: path <chr>, # sheet_name <list>, id <chr>
Here we have one row per workbook in books
with the workbook's worksheet names stored in a list column. We want one row per worksheet with the data contents of the worksheet stored in a list column so that we can add extra features based on the worksheets data (the worksheet is the experimental unit). The problem is it doesn't vectorise as expected, am I missing something?
This errors...
sheets <-
tibble::tibble("sheet_name" = unlist(books$sheet_name),
"path" = rep(paths,
times = unlist(
purrr::map_int(books$sheet_name, length))
),
"filename" = basename(path),
"sheet_data" = tibble::lst(
readxl::read_excel(path = path[],
sheet = sheet_name[])
)
) %>%
dplyr::mutate(id = as.character(row_number()))
Error in switch(ext, xls = "xls", xlsx = "xlsx", xlsm = "xlsx", if (nzchar(ext)) { :
EXPR must be a length 1 vector
The code works when not passed a vector for workbook path and sheet name, but obviously the data is not from the correct worksheet in this example below:
sheets <-
tibble::tibble("sheet_name" = unlist(books$sheet_name),
"path" = rep(paths,
times = unlist(
purrr::map_int(books$sheet_name, length))
),
"filename" = basename(path),
"sheet_data" = tibble::lst(
readxl::read_excel(path = path[1],
sheet = sheet_name[1])
)
) %>%
dplyr::mutate(id = as.character(row_number()))
dplyr::glimpse(sheets)
Observations: 313
Variables: 5
$ sheet_name <chr> "MLP's", "DJ SP15", "newpower-p...
$ path <chr> "../data/enron_spreadsheets//ke...
$ filename <chr> "kenneth_lay__19485__Mlp_1109.x...
$ sheet_data <list> [<# A tibble: 57 x 46, ...
$ id <chr> "1", "2", "3", "4", "5", "6", "...
How do I read in the data from many worksheets in many workbooks into a list column in a tibble?
I'm new to reading in messy spreadsheets and using purrr
any help or pointers would be appreciated.
Since you mention the purrr
package, some other tidyverse packages are worth considering.
dplyr
for mutate()
, when applying purrr::map()
to a column of a data frame and storing the result as list-column. tidyr
for unnest()
, which expands a list-column so that each row inside a list-column becomes a row in the overall data frame. tibble
for nicely printed nested data frames Sample files are needed to demonstrate. This code uses the openxlsx
package to create one file containing two sheets (the built-in iris
and mtcars
datasets), and another file containing three sheets (adding the built-in attitude
dataset).
library(openxlsx)
# Create two spreadsheet files, with different numbers of worksheets
write.xlsx(list(iris, mtcars, attitude), "three_sheets.xlsx")
write.xlsx(list(iris, mtcars), "two_sheets.xlsx")
Now a solution.
First, list the filenames, which will passed to readxl::excel_sheets()
for the names of the sheets within each file, and readxl::read_excel()
to import the data itself.
(paths <- list.files(pattern = "*.xlsx"))
#> [1] "three_sheets.xlsx" "two_sheets.xlsx"
(x <- tibble::data_frame(path = paths))
#> # A tibble: 2 x 1
#> path
#> <chr>
#> 1 three_sheets.xlsx
#> 2 two_sheets.xlsx
'Map' the readxl::excel_sheets()
function over each of the file paths, and store the results in a new list-column. Each row of the sheet_name
column is a vector of sheet names. As expected, the first one has three sheet names, while the second has two.
(x <- dplyr::mutate(x, sheet_name = purrr::map(path, readxl::excel_sheets)))
#> # A tibble: 2 x 2
#> path sheet_name
#> <chr> <list>
#> 1 three_sheets.xlsx <chr [3]>
#> 2 two_sheets.xlsx <chr [2]>
We need to pass each filename and each sheet name into readxl::read_excel(path=, sheet=)
, so the next step is to have a data frame where each row gives a path and one sheet name. This is done using tidyr::unnest()
.
(x <- tidyr::unnest(x))
#> # A tibble: 5 x 2
#> path sheet_name
#> <chr> <chr>
#> 1 three_sheets.xlsx Sheet 1
#> 2 three_sheets.xlsx Sheet 2
#> 3 three_sheets.xlsx Sheet 3
#> 4 two_sheets.xlsx Sheet 1
#> 5 two_sheets.xlsx Sheet 2
Now each path and sheet name can be passed into readxl::read_excel()
, using purrr::map2()
rather than purrr::map()
because we pass two arguments rather than one.
(x <- dplyr::mutate(x, data = purrr::map2(path, sheet_name,
~ readxl::read_excel(.x, .y))))
#> # A tibble: 5 x 3
#> path sheet_name data
#> <chr> <chr> <list>
#> 1 three_sheets.xlsx Sheet 1 <tibble [150 × 5]>
#> 2 three_sheets.xlsx Sheet 2 <tibble [32 × 11]>
#> 3 three_sheets.xlsx Sheet 3 <tibble [30 × 7]>
#> 4 two_sheets.xlsx Sheet 1 <tibble [150 × 5]>
#> 5 two_sheets.xlsx Sheet 2 <tibble [32 × 11]>
Now each dataset is in a separate row of the data
column. We can look at just one of the datasets by subsetting that column.
x$data[3]
#> [[1]]
#> # A tibble: 30 x 7
#> rating complaints privileges learning raises critical advance
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 43.0 51.0 30.0 39.0 61.0 92.0 45.0
#> 2 63.0 64.0 51.0 54.0 63.0 73.0 47.0
#> 3 71.0 70.0 68.0 69.0 76.0 86.0 48.0
#> 4 61.0 63.0 45.0 47.0 54.0 84.0 35.0
#> 5 81.0 78.0 56.0 66.0 71.0 83.0 47.0
#> 6 43.0 55.0 49.0 44.0 54.0 49.0 34.0
#> 7 58.0 67.0 42.0 56.0 66.0 68.0 35.0
#> 8 71.0 75.0 50.0 55.0 70.0 66.0 41.0
#> 9 72.0 82.0 72.0 67.0 71.0 83.0 31.0
#> 10 67.0 61.0 45.0 47.0 62.0 80.0 41.0
#> # ... with 20 more rows
I just tested this an it worked fine for one workbook.
library(readxl)
read_excel_allsheets <- function(filename) {
sheets <- readxl::excel_sheets(filename)
x <- lapply(sheets, function(X) readxl::read_excel(filename, sheet = X))
names(x) <- sheets
x
}
This could be called with:
mysheets <- read_excel_allsheets("foo.xls")
Note, it is for xls and xlsx; it will not work for xlsb files.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.