简体   繁体   English

循环多个 excel 文件以创建不同的数据帧,执行分组并保存为 R 中的单个 df

[英]Loop over several excel files to create different dataframes, perform group by and save as a single df in R

I am new in R and I have a doubt, in case you can help.我是 R 的新手,我有疑问,以防您能提供帮助。

I have several excel files in one folder.我在一个文件夹中有几个 excel 文件。 They belong to different filials but have the same structure.它们属于不同的子系,但结构相同。
I would like to loop over them, load into R as a dataframe, perform a group by and save everything in a single dataframe and export as a single file.我想遍历它们,以 dataframe 的形式加载到 R 中,执行分组并将所有内容保存在单个 dataframe 中并导出为单个文件。 Would this be possible?这可能吗?

By looking at several answers here I did this:通过查看这里的几个答案,我做到了:

# Load the data as different dataframes

library(tidyverse)
library(readxl)

f <- list.files(pattern="xlsx")

myfiles = lapply(f, read_excel)


for (i in 1:length(f)) assign(f[i], read_excel(f[i], sheet = "Deutsch", skip=7), data.frame(f[i]))

I have them saved as single dataframes, I don't know how to access them all together, so I manually created a list:我将它们保存为单个数据框,我不知道如何一起访问它们,所以我手动创建了一个列表:

list_df = list(filialAA.xlsx, filialAB.xlsx,filianAC.xlsx,filianAD.xlsx,filianAE.xlsx...etc)

Then I created a group by to perform some calculations:然后我创建了一个 group by 来执行一些计算:

for (i in 1:length(list_df))
{
  list_df[i] %>% 
    group_by(ABC) %>% 
    summarise(`Revenue in EUR` = sum(`Revenue in EUR`),
              `Weight in KG` = sum(`Weight in KG`),
              `Number of Materials` = length(`Materials`),
              `Avg of deliveries` = mean(`Deliveries`))
}

If I do this for each dataframe, it works.如果我为每个 dataframe 执行此操作,它就可以工作。 But inside this loop it does not.但在这个循环中它没有。 Could you help me to loop over all dataframes, perform this group by and gather together in one single file?你能帮我遍历所有数据帧,执行这个分组并聚集在一个文件中吗? Is it possible?可能吗?

Thanks a lot for your attention!非常感谢您的关注!

EDIT: To include a dummy data sample:编辑:要包括一个虚拟数据样本:

> dput(df1)

structure(list(Materials = c("11575358", "75378378", "21333333", 
"02469984", "05465478", "05645648"), Deliveries = c(8, 1, 12, 
5, 1, 1), ABC = c("C", "A", "C", "B", "C", "C"), `Revenue in EUR` = c(6179, 
1804802.46, 3768.04, 9e+05, 1597.5, 1544.55), `Weight in KG` = c(16.6, 
4.695625, 19, 9.14625, 2.74041666666667, 1.44208333333333)), row.names = c(NA, 
-6L), class = c("tbl_df", "tbl", "data.frame"))

> dput(df2)

structure(list(Materials = c("48654798", "05465489", "04598496", 
"08789453", "01589494", "06459849", "54694985", "65498848"), 
    Deliveries = c(24, 6, 32, 3, 11, 30, 45, 2), ABC = c("C", 
    "B", "C", "B", "C", "A", "A", "C"), `Revenue in EUR` = c(5509, 
    506978, 3978.04, 7e+05, 1597.5, 1200258, 2406975, 4059), 
    `Weight in KG` = c(29.6, 19, 24, 9.14625, 2.74041666666667, 
    50, 60, 10)), row.names = c(NA, -8L), class = c("tbl_df", 
"tbl", "data.frame"))

Original excel is xlsx format, have from 5000 to 15000 rows, about 20 features, 7 tabs.原始 excel 是 xlsx 格式,有 5000 到 15000 行,大约 20 个功能,7 个选项卡。 There are 22 excel files to loop over.有 22 个 excel 文件需要循环。

Ok it could have some error due I have not your files, but try something like this:好的,由于我没有您的文件,它可能会出现一些错误,但请尝试以下操作:

# first of, write down your files in xlsx. I use xlsx because I prefere it
#but you should already have them
xlsx::write.xlsx2(df1,"df1.xlsx")
xlsx::write.xlsx2(df1,"df2.xlsx")

library(tidyverse)
library(readxl)

# here you get all the xlsx files
f <- list.files(pattern="xlsx")  
f
[1] "df1.xlsx" "df2.xlsx"

# an empty list
listed <- list()
# loop that populate the empty list with your files
for (i in f) { 
  listed[[i]] <- read_excel(i, sheet = "Sheet1" # , skip = 7  
                            )
  print(paste0("read the", i," file")) # here it says what it's doing
}

 listed
$df1.xlsx
# A tibble: 6 x 6
  ...1  Materials Deliveries ABC   `Revenue in EUR` `Weight in KG`
  <chr> <chr>          <dbl> <chr>            <dbl>          <dbl>
1 1     11575358           8 C                6179           16.6 
2 2     75378378           1 A             1804802.           4.70
3 3     21333333          12 C                3768.          19   
4 4     02469984           5 B              900000            9.15
5 5     05465478           1 C                1598.           2.74
6 6     05645648           1 C                1545.           1.44

$df2.xlsx
# A tibble: 6 x 6
  ...1  Materials Deliveries ABC   `Revenue in EUR` `Weight in KG`
  <chr> <chr>          <dbl> <chr>            <dbl>          <dbl>
1 1     11575358           8 C                6179           16.6 
2 2     75378378           1 A             1804802.           4.70
3 3     21333333          12 C                3768.          19   
4 4     02469984           5 B              900000            9.15
5 5     05465478           1 C                1598.           2.74
6 6     05645648           1 C                1545.           1.44

# now lapply to each element of the list, the summary, creating a new list
list_result <- lapply(listed, function(x) x %>% 
                                          group_by(ABC) %>% 
                                          summarise(
                          `Revenue in EUR` = sum(`Revenue in EUR`),
                          `Weight in KG` = sum(`Weight in KG`),
                          `Number of Materials` = length(`Materials`),
                          `Avg of deliveries` = mean(`Deliveries`)))

# put the result in a data.frame  
do.call(rbind,list_result)
# A tibble: 6 x 5
  ABC   `Revenue in EUR` `Weight in KG` `Number of Materials` `Avg of deliveries`
* <chr>            <dbl>          <dbl>                 <int>               <dbl>
1 A             1804802.           4.70                     1                 1  
2 B              900000            9.15                     1                 5  
3 C               13089.          39.8                      4                 5.5
4 A             1804802.           4.70                     1                 1  
5 B              900000            9.15                     1                 5  
6 C               13089.          39.8                      4                 5.5

You may also use purrr::map suitably您也可以适当地使用purrr::map

map_dfr(list_df, ~(. %>% 
    group_by(ABC) %>% 
    summarise(`Revenue in EUR` = sum(`Revenue in EUR`),
              `Weight in KG` = sum(`Weight in KG`),
              `Number of Materials` = length(`Materials`),
              `Avg of deliveries` = mean(`Deliveries`))))

It will rbind the results simultaneously.它将同时rbind结果。

Even after storing files in myfiles you can use the following syntax即使在将文件存储在myfiles之后,您也可以使用以下语法


library(janitor)
map_dfr(myfiles, ~(.[-c(1:5),] %>% row_to_names(1) %>% 
                     group_by(ABC) %>% 
                     summarise(`Revenue in EUR` = sum(as.numeric(`Revenue in EUR`)),
                               `Weight in KG` = sum(as.numeric(`Weight in KG`)),
                               `Number of Materials` = length(`Materials`),
                               `Avg of deliveries` = mean(as.numeric(`Deliveries`)))
                   %>% ungroup()))

results with your given files给定文件的结果

# A tibble: 6 x 5
  ABC   `Revenue in EUR` `Weight in KG` `Number of Materials` `Avg of deliveries`
  <chr>            <dbl>          <dbl>                 <int>               <dbl>
1 A             1804802.           4.70                     1                 1  
2 B              900000            9.15                     1                 5  
3 C               13089.          39.8                      4                 5.5
4 A             3607233          110                        2                37.5
5 B             1206978           28.1                      2                 4.5
6 C               15144.          66.3                      4                17.2

I like to write functions, so I would do it like this (although longer it creates a more stable environment to modify/debug when required).我喜欢编写函数,所以我会这样做(尽管更长的时间它会创建一个更稳定的环境来在需要时进行修改/调试)。

# Main Function
main_function <- function(import, name){
 main_function.create_path() -> path
 main_function.create_output() -> output
 for(file in list.files(path){
  if(!str_detect(file, 'csv')){
   next
  }
  read_excel(file, sheet = "Deutsch", skip = 7) -> data
  main_function.calculate_values(data) -> data.values
  main_function.append_values(file, data, data.values, output) -> output
 }
 main_function.export(path, output, name)
 if(import){
  assign('values', output, envir = .Globalenv)
 }
}
    
# Functions
main_function.export <- function(path, output, name){
 write.csv(output, file = paste0(path, name, '.csv'))
 }
 
main_function.append_values <- function(file, data, data.values,   output){
 # This will create a row in the output file with the name of the file
 # without the .csv at the end in the first column and put in the 
 # calculated data in the other columns
 str_extract(file, ".+(?=.csv)") -> output[nrow(output) + 1, 'file'] 
 for(col in colnames(data.values)){
   data.values[,col] -> output[nrow(output), col]
 return(output)
 }

main_function.calculate_values <- function(data){
 data %>% group_by(ABC) %>%
  summarize(`Revenue in EUR` = sum(`Revenue in Eur`, na.rm=TRUE),
            ....) -> data
 return(data)
 }
 
main_function.create_path <- function(){
 '<path to files>' -> path
 return(path) 
 }
   
main_function.create_output <- function(){
 data.frame('file' = as.character(NA), 'Revenue in EUR' = 0, 
  'Weight in KG' = 0, 'Number of Materials' = 0, 'Avg of deliveries' = 0) -> output
 return(output)
 }

This will create the main_function that when called will cycle through all of the files listed in the path given and read it, process it, save it to output which which will be saved in the same path with the name you give it.这将创建main_function ,当调用该函数时,它将循环遍历给定路径中列出的所有文件并读取它,处理它,将其保存到output ,它将保存在与您给它的名称相同的路径中。 If you set import to TRUE it will also save the output如果将import设置为 TRUE,它还将保存 output

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM