简体   繁体   English

读入多个excel文件,加一列,然后绑定

[英]Reading in multiple excel files, adding a column, then binding

I have a series of Excel files that I want to read into R, add in a date column based on the file name, then bind together.我有一系列 Excel 文件,我想读入 R,根据文件名添加日期列,然后绑定在一起。

The naming convention of the files is User_Info_Jan, User_Info_Feb, User_Info_Mar.文件的命名约定是 User_Info_Jan、User_Info_Feb、User_Info_Mar。 The month is only referenced in the name of the file and not actually mentioned in the actual file itself.月份仅在文件名中引用,而在实际文件本身中并未实际提及。 An example of what the User_Info_Jan files looks like: User_Info_Jan 文件的示例如下:

ID   Name
ABC  Joe Smith
DEF  Henry Cooper 
ZCS  Kelly Ma

Is there a way I can read the files in using the pattern in the file name (pattern = User_Info_), then add a column called "Month" indicating what month the file is for, before binding together?有没有办法我可以使用文件名中的模式读取文件(pattern = User_Info_),然后在绑定在一起之前添加一个名为“Month”的列,指示文件的月份?

Sample Data frame after month column:月列后的示例数据框:

ID   Name           Month
ABC  Joe Smith      January
DEF  Henry Cooper   January
ZCS  Kelly Ma       January

Sample data frame after binding together:绑定后的示例数据框:

ID   Name           Usage Month
ABC  Joe Smith      January
DEF  Henry Cooper   January
ZCS  Kelly Ma       January
KFY  Lisa Schwartz  February
LFG  Alex Shah      March

I would use the map() function to solve this problem from the purrr library.我会使用map() function 从 purrr 库中解决这个问题。

Without a reproducible format as we are reding in files an example from my recent code is as follows:没有可重现的格式,因为我们正在重新编写文件,我最近的代码中的一个示例如下:

# Get all the filenames (I assume this contains the month data in  your case)
GravfilesMap <- list.files("GravityModel/MapOut", full.names = T)

GravMap <-
  # Use Regex to select the string for the month for you I would try "_([a-zA-Z]+).xlsx" passed to the str_match function (this gets the month names as a column)
  (GravfilesMap %>% str_match("(\\d+).csv$"))[,2] %>%
  # Convert to a data frame
  tibble %>% 
  # For each file_name read in the data to its own data frame (this will give on each row a month name and then a nested dataframe)
  # I have used read_csv here you will use something like read_xls
  # The order of the files is the same as the order of our months as we are importing them in the order specified by the list
  mutate(file_contents = map(GravfilesMap, ~read_csv(., col_names = F)))

# Unnest the dataframes to appear in the form that was requested
GravMap <- GravMap %>% unnest()

Detail on a similar method can be found at https://clauswilke.com/blog/2016/06/13/reading-and-combining-many-tidy-data-files-in-r/有关类似方法的详细信息,请参见https://clauswilke.com/blog/2016/06/13/reading-and-combining-many-tidy-data-files-in-r/

I'll demonstrate with fake filenames, but the real commands I suggest you run are commented out with the same structure.我将使用假文件名进行演示,但我建议您运行的真实命令使用相同的结构注释掉。 I'm assuming .xlsx for "excel files", but this works equally well with .csv (just update the pattern).我假设.xlsx用于“excel 文件”,但这同样适用于.csv (只需更新模式)。

# files <- list.files(path = ".", pattern = "User_Info_.*\\.xlsx$", full.names = TRUE)
files <- c("./User_Info_Jan.xlsx", "./User_Info_Feb.xlsx", "./User_Info_Mar.xlsx")
monthnames <- strcapture("User_Info_(.*)\\.xlsx", files, list(month = ""))
monthnames
#   month
# 1   Jan
# 2   Feb
# 3   Mar

At this point, we've extracted the month name from each filename.此时,我们已经从每个文件名中提取了月份名称。 I find strcapture (in base R) better than gsub , as the latter returns the entire string if there are no matches;我发现strcapture (在 base R 中)比gsub更好,因为如果没有匹配项,后者会返回整个字符串; another alternative in base R is regmatches(files, gregexpr(...)) , but that seems a bit more complicated than it needs to be here.基础 R 中的另一种选择是regmatches(files, gregexpr(...)) ,但这似乎比这里需要的要复杂一些。 Another alternative is stringr::str_extract which might be more intuitive if you're already using stringr and/or other tidyverse packages.另一种选择是stringr::str_extract如果您已经在使用stringr和/或其他 tidyverse 包,它可能会更直观。

From here, we can iterate over the files to read them in.从这里,我们可以遍历文件以读取它们。

# out <- Map(function(mn, fn) transform(readxl::read_excel(fn), month = mn), monthnames$month, files)
set.seed(42)
out <- Map(function(mn, fn) transform(mtcars[sample(32,size=2),], month = mn), monthnames$month, files)
out
# $Jan
#                    mpg cyl disp  hp drat    wt  qsec vs am gear carb month
# Chrysler Imperial 14.7   8  440 230 3.23 5.345 17.42  0  0    3    4   Jan
# Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2   Jan
# $Feb
#                   mpg cyl disp  hp drat    wt  qsec vs am gear carb month
# Mazda RX4        21.0   6  160 110 3.90 2.620 16.46  0  1    4    4   Feb
# Pontiac Firebird 19.2   8  400 175 3.08 3.845 17.05  0  0    3    2   Feb
# $Mar
#                 mpg cyl  disp  hp drat    wt  qsec vs am gear carb month
# Merc 280       19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4   Mar
# Hornet 4 Drive 21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1   Mar

Combining that list-of-frames into a single frame is direct:将帧列表组合成单个帧是直接的:

do.call(rbind, out)
#                        mpg cyl  disp  hp drat    wt  qsec vs am gear carb month
# Jan.Chrysler Imperial 14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4   Jan
# Jan.Hornet Sportabout 18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2   Jan
# Feb.Mazda RX4         21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4   Feb
# Feb.Pontiac Firebird  19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2   Feb
# Mar.Merc 280          19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4   Mar
# Mar.Hornet 4 Drive    21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1   Mar

An alternative to all of that can use data.table::rbindlist or dplyr::bind_rows , and assign the "id" column directly:所有这些的替代方法可以使用data.table::rbindlistdplyr::bind_rows ,并直接分配“id”列:

# out <- Map(function(mn, fn) readxl::read_excel(fn), monthnames$month, files)
set.seed(42)
out <- Map(function(mn, fn) mtcars[sample(32,size=2),], monthnames$month, files)

data.table::rbindlist(out, idcol = "month")
#     month   mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#    <char> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num>
# 1:    Jan  14.7     8 440.0   230  3.23 5.345 17.42     0     0     3     4
# 2:    Jan  18.7     8 360.0   175  3.15 3.440 17.02     0     0     3     2
# 3:    Feb  21.0     6 160.0   110  3.90 2.620 16.46     0     1     4     4
# 4:    Feb  19.2     8 400.0   175  3.08 3.845 17.05     0     0     3     2
# 5:    Mar  19.2     6 167.6   123  3.92 3.440 18.30     1     0     4     4
# 6:    Mar  21.4     6 258.0   110  3.08 3.215 19.44     1     0     3     1

dplyr::bind_rows(out, .id = "month")
#                   month  mpg cyl  disp  hp drat    wt  qsec vs am gear carb
# Chrysler Imperial   Jan 14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
# Hornet Sportabout   Jan 18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
# Mazda RX4           Feb 21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
# Pontiac Firebird    Feb 19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
# Merc 280            Mar 19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
# Hornet 4 Drive      Mar 21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1

The latter two work because when I called Map earlier, the first argument ( monthnames$month ) passed to the inner function is used as the names for the list output, which is why you see $Jan etc as the elements of the returned list.后两者有效,因为当我早些时候调用Map时,传递给内部 function 的第一个参数 ( monthnames$month ) 用作list Z78E6221F6393D1356681DB398F14CED 的名称,为什么返回列表$Jan JanCED6。 Both rbindlist and bind_rows use those names as "id" columns when idcol= / .id= are used.当使用idcol= / .id=时, rbindlistbind_rows都将这些名称用作“id”列。 (If no "names" are actually present, both functions count along them.) (如果实际上不存在“名称”,则这两个函数都计算在内。)

You can try with purrr package like this:您可以像这样尝试使用purrr package:

files <- c("./User_Info_Jan.xlsx", "./User_Info_Feb.xlsx", "./User_Info_Mar.xlsx")
months <- c("Jan","Feb","Mar")

library(openxlsx)
library(purrr)
map2_dfr(files,months,function(x,y) read.xlsx(x) %>% mutate(Month=y))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM