[英]Reading in multiple excel files, adding a column, then binding
I have a series of Excel files that I want to read into R, add in a date column based on the file name, then bind together.我有一系列 Excel 文件,我想读入 R,根据文件名添加日期列,然后绑定在一起。
The naming convention of the files is User_Info_Jan, User_Info_Feb, User_Info_Mar.文件的命名约定是 User_Info_Jan、User_Info_Feb、User_Info_Mar。 The month is only referenced in the name of the file and not actually mentioned in the actual file itself.
月份仅在文件名中引用,而在实际文件本身中并未实际提及。 An example of what the User_Info_Jan files looks like:
User_Info_Jan 文件的示例如下:
ID Name
ABC Joe Smith
DEF Henry Cooper
ZCS Kelly Ma
Is there a way I can read the files in using the pattern in the file name (pattern = User_Info_), then add a column called "Month" indicating what month the file is for, before binding together?有没有办法我可以使用文件名中的模式读取文件(pattern = User_Info_),然后在绑定在一起之前添加一个名为“Month”的列,指示文件的月份?
Sample Data frame after month column:月列后的示例数据框:
ID Name Month
ABC Joe Smith January
DEF Henry Cooper January
ZCS Kelly Ma January
Sample data frame after binding together:绑定后的示例数据框:
ID Name Usage Month
ABC Joe Smith January
DEF Henry Cooper January
ZCS Kelly Ma January
KFY Lisa Schwartz February
LFG Alex Shah March
I would use the map()
function to solve this problem from the purrr library.我会使用
map()
function 从 purrr 库中解决这个问题。
Without a reproducible format as we are reding in files an example from my recent code is as follows:没有可重现的格式,因为我们正在重新编写文件,我最近的代码中的一个示例如下:
# Get all the filenames (I assume this contains the month data in your case)
GravfilesMap <- list.files("GravityModel/MapOut", full.names = T)
GravMap <-
# Use Regex to select the string for the month for you I would try "_([a-zA-Z]+).xlsx" passed to the str_match function (this gets the month names as a column)
(GravfilesMap %>% str_match("(\\d+).csv$"))[,2] %>%
# Convert to a data frame
tibble %>%
# For each file_name read in the data to its own data frame (this will give on each row a month name and then a nested dataframe)
# I have used read_csv here you will use something like read_xls
# The order of the files is the same as the order of our months as we are importing them in the order specified by the list
mutate(file_contents = map(GravfilesMap, ~read_csv(., col_names = F)))
# Unnest the dataframes to appear in the form that was requested
GravMap <- GravMap %>% unnest()
Detail on a similar method can be found at https://clauswilke.com/blog/2016/06/13/reading-and-combining-many-tidy-data-files-in-r/有关类似方法的详细信息,请参见https://clauswilke.com/blog/2016/06/13/reading-and-combining-many-tidy-data-files-in-r/
I'll demonstrate with fake filenames, but the real commands I suggest you run are commented out with the same structure.我将使用假文件名进行演示,但我建议您运行的真实命令使用相同的结构注释掉。 I'm assuming
.xlsx
for "excel files", but this works equally well with .csv
(just update the pattern).我假设
.xlsx
用于“excel 文件”,但这同样适用于.csv
(只需更新模式)。
# files <- list.files(path = ".", pattern = "User_Info_.*\\.xlsx$", full.names = TRUE)
files <- c("./User_Info_Jan.xlsx", "./User_Info_Feb.xlsx", "./User_Info_Mar.xlsx")
monthnames <- strcapture("User_Info_(.*)\\.xlsx", files, list(month = ""))
monthnames
# month
# 1 Jan
# 2 Feb
# 3 Mar
At this point, we've extracted the month name from each filename.此时,我们已经从每个文件名中提取了月份名称。 I find
strcapture
(in base R) better than gsub
, as the latter returns the entire string if there are no matches;我发现
strcapture
(在 base R 中)比gsub
更好,因为如果没有匹配项,后者会返回整个字符串; another alternative in base R is regmatches(files, gregexpr(...))
, but that seems a bit more complicated than it needs to be here.基础 R 中的另一种选择是
regmatches(files, gregexpr(...))
,但这似乎比这里需要的要复杂一些。 Another alternative is stringr::str_extract
which might be more intuitive if you're already using stringr
and/or other tidyverse packages.另一种选择是
stringr::str_extract
如果您已经在使用stringr
和/或其他 tidyverse 包,它可能会更直观。
From here, we can iterate over the files to read them in.从这里,我们可以遍历文件以读取它们。
# out <- Map(function(mn, fn) transform(readxl::read_excel(fn), month = mn), monthnames$month, files)
set.seed(42)
out <- Map(function(mn, fn) transform(mtcars[sample(32,size=2),], month = mn), monthnames$month, files)
out
# $Jan
# mpg cyl disp hp drat wt qsec vs am gear carb month
# Chrysler Imperial 14.7 8 440 230 3.23 5.345 17.42 0 0 3 4 Jan
# Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 Jan
# $Feb
# mpg cyl disp hp drat wt qsec vs am gear carb month
# Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 Feb
# Pontiac Firebird 19.2 8 400 175 3.08 3.845 17.05 0 0 3 2 Feb
# $Mar
# mpg cyl disp hp drat wt qsec vs am gear carb month
# Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 Mar
# Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 Mar
Combining that list-of-frames into a single frame is direct:将帧列表组合成单个帧是直接的:
do.call(rbind, out)
# mpg cyl disp hp drat wt qsec vs am gear carb month
# Jan.Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4 Jan
# Jan.Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 Jan
# Feb.Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 Feb
# Feb.Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2 Feb
# Mar.Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 Mar
# Mar.Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 Mar
An alternative to all of that can use data.table::rbindlist
or dplyr::bind_rows
, and assign the "id" column directly:所有这些的替代方法可以使用
data.table::rbindlist
或dplyr::bind_rows
,并直接分配“id”列:
# out <- Map(function(mn, fn) readxl::read_excel(fn), monthnames$month, files)
set.seed(42)
out <- Map(function(mn, fn) mtcars[sample(32,size=2),], monthnames$month, files)
data.table::rbindlist(out, idcol = "month")
# month mpg cyl disp hp drat wt qsec vs am gear carb
# <char> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num>
# 1: Jan 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
# 2: Jan 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
# 3: Feb 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
# 4: Feb 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
# 5: Mar 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
# 6: Mar 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
dplyr::bind_rows(out, .id = "month")
# month mpg cyl disp hp drat wt qsec vs am gear carb
# Chrysler Imperial Jan 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
# Hornet Sportabout Jan 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
# Mazda RX4 Feb 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
# Pontiac Firebird Feb 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
# Merc 280 Mar 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
# Hornet 4 Drive Mar 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
The latter two work because when I called Map
earlier, the first argument ( monthnames$month
) passed to the inner function is used as the names for the list
output, which is why you see $Jan
etc as the elements of the returned list.后两者有效,因为当我早些时候调用
Map
时,传递给内部 function 的第一个参数 ( monthnames$month
) 用作list
Z78E6221F6393D1356681DB398F14CED 的名称,为什么返回列表$Jan
JanCED6。 Both rbindlist
and bind_rows
use those names as "id" columns when idcol=
/ .id=
are used.当使用
idcol=
/ .id=
时, rbindlist
和bind_rows
都将这些名称用作“id”列。 (If no "names" are actually present, both functions count along them.) (如果实际上不存在“名称”,则这两个函数都计算在内。)
You can try with purrr
package like this:您可以像这样尝试使用
purrr
package:
files <- c("./User_Info_Jan.xlsx", "./User_Info_Feb.xlsx", "./User_Info_Mar.xlsx")
months <- c("Jan","Feb","Mar")
library(openxlsx)
library(purrr)
map2_dfr(files,months,function(x,y) read.xlsx(x) %>% mutate(Month=y))
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.