[英]Loop over several excel files to create different dataframes, perform group by and save as a single df in R
I am new in R and I have a doubt, in case you can help.我是 R 的新手,我有疑问,以防您能提供帮助。
I have several excel files in one folder.我在一个文件夹中有几个 excel 文件。 They belong to different filials but have the same structure.
它们属于不同的子系,但结构相同。
I would like to loop over them, load into R as a dataframe, perform a group by and save everything in a single dataframe and export as a single file.我想遍历它们,以 dataframe 的形式加载到 R 中,执行分组并将所有内容保存在单个 dataframe 中并导出为单个文件。 Would this be possible?
这可能吗?
By looking at several answers here I did this:通过查看这里的几个答案,我做到了:
# Load the data as different dataframes
library(tidyverse)
library(readxl)
f <- list.files(pattern="xlsx")
myfiles = lapply(f, read_excel)
for (i in 1:length(f)) assign(f[i], read_excel(f[i], sheet = "Deutsch", skip=7), data.frame(f[i]))
I have them saved as single dataframes, I don't know how to access them all together, so I manually created a list:我将它们保存为单个数据框,我不知道如何一起访问它们,所以我手动创建了一个列表:
list_df = list(filialAA.xlsx, filialAB.xlsx,filianAC.xlsx,filianAD.xlsx,filianAE.xlsx...etc)
Then I created a group by to perform some calculations:然后我创建了一个 group by 来执行一些计算:
for (i in 1:length(list_df))
{
list_df[i] %>%
group_by(ABC) %>%
summarise(`Revenue in EUR` = sum(`Revenue in EUR`),
`Weight in KG` = sum(`Weight in KG`),
`Number of Materials` = length(`Materials`),
`Avg of deliveries` = mean(`Deliveries`))
}
If I do this for each dataframe, it works.如果我为每个 dataframe 执行此操作,它就可以工作。 But inside this loop it does not.
但在这个循环中它没有。 Could you help me to loop over all dataframes, perform this group by and gather together in one single file?
你能帮我遍历所有数据帧,执行这个分组并聚集在一个文件中吗? Is it possible?
可能吗?
Thanks a lot for your attention!非常感谢您的关注!
EDIT: To include a dummy data sample:编辑:要包括一个虚拟数据样本:
> dput(df1)
structure(list(Materials = c("11575358", "75378378", "21333333",
"02469984", "05465478", "05645648"), Deliveries = c(8, 1, 12,
5, 1, 1), ABC = c("C", "A", "C", "B", "C", "C"), `Revenue in EUR` = c(6179,
1804802.46, 3768.04, 9e+05, 1597.5, 1544.55), `Weight in KG` = c(16.6,
4.695625, 19, 9.14625, 2.74041666666667, 1.44208333333333)), row.names = c(NA,
-6L), class = c("tbl_df", "tbl", "data.frame"))
> dput(df2)
structure(list(Materials = c("48654798", "05465489", "04598496",
"08789453", "01589494", "06459849", "54694985", "65498848"),
Deliveries = c(24, 6, 32, 3, 11, 30, 45, 2), ABC = c("C",
"B", "C", "B", "C", "A", "A", "C"), `Revenue in EUR` = c(5509,
506978, 3978.04, 7e+05, 1597.5, 1200258, 2406975, 4059),
`Weight in KG` = c(29.6, 19, 24, 9.14625, 2.74041666666667,
50, 60, 10)), row.names = c(NA, -8L), class = c("tbl_df",
"tbl", "data.frame"))
Original excel is xlsx format, have from 5000 to 15000 rows, about 20 features, 7 tabs.原始 excel 是 xlsx 格式,有 5000 到 15000 行,大约 20 个功能,7 个选项卡。 There are 22 excel files to loop over.
有 22 个 excel 文件需要循环。
Ok it could have some error due I have not your files, but try something like this:好的,由于我没有您的文件,它可能会出现一些错误,但请尝试以下操作:
# first of, write down your files in xlsx. I use xlsx because I prefere it
#but you should already have them
xlsx::write.xlsx2(df1,"df1.xlsx")
xlsx::write.xlsx2(df1,"df2.xlsx")
library(tidyverse)
library(readxl)
# here you get all the xlsx files
f <- list.files(pattern="xlsx")
f
[1] "df1.xlsx" "df2.xlsx"
# an empty list
listed <- list()
# loop that populate the empty list with your files
for (i in f) {
listed[[i]] <- read_excel(i, sheet = "Sheet1" # , skip = 7
)
print(paste0("read the", i," file")) # here it says what it's doing
}
listed
$df1.xlsx
# A tibble: 6 x 6
...1 Materials Deliveries ABC `Revenue in EUR` `Weight in KG`
<chr> <chr> <dbl> <chr> <dbl> <dbl>
1 1 11575358 8 C 6179 16.6
2 2 75378378 1 A 1804802. 4.70
3 3 21333333 12 C 3768. 19
4 4 02469984 5 B 900000 9.15
5 5 05465478 1 C 1598. 2.74
6 6 05645648 1 C 1545. 1.44
$df2.xlsx
# A tibble: 6 x 6
...1 Materials Deliveries ABC `Revenue in EUR` `Weight in KG`
<chr> <chr> <dbl> <chr> <dbl> <dbl>
1 1 11575358 8 C 6179 16.6
2 2 75378378 1 A 1804802. 4.70
3 3 21333333 12 C 3768. 19
4 4 02469984 5 B 900000 9.15
5 5 05465478 1 C 1598. 2.74
6 6 05645648 1 C 1545. 1.44
# now lapply to each element of the list, the summary, creating a new list
list_result <- lapply(listed, function(x) x %>%
group_by(ABC) %>%
summarise(
`Revenue in EUR` = sum(`Revenue in EUR`),
`Weight in KG` = sum(`Weight in KG`),
`Number of Materials` = length(`Materials`),
`Avg of deliveries` = mean(`Deliveries`)))
# put the result in a data.frame
do.call(rbind,list_result)
# A tibble: 6 x 5
ABC `Revenue in EUR` `Weight in KG` `Number of Materials` `Avg of deliveries`
* <chr> <dbl> <dbl> <int> <dbl>
1 A 1804802. 4.70 1 1
2 B 900000 9.15 1 5
3 C 13089. 39.8 4 5.5
4 A 1804802. 4.70 1 1
5 B 900000 9.15 1 5
6 C 13089. 39.8 4 5.5
You may also use purrr::map
suitably您也可以适当地使用
purrr::map
map_dfr(list_df, ~(. %>%
group_by(ABC) %>%
summarise(`Revenue in EUR` = sum(`Revenue in EUR`),
`Weight in KG` = sum(`Weight in KG`),
`Number of Materials` = length(`Materials`),
`Avg of deliveries` = mean(`Deliveries`))))
It will rbind
the results simultaneously.它将同时
rbind
结果。
Even after storing files in myfiles
you can use the following syntax即使在将文件存储在
myfiles
之后,您也可以使用以下语法
library(janitor)
map_dfr(myfiles, ~(.[-c(1:5),] %>% row_to_names(1) %>%
group_by(ABC) %>%
summarise(`Revenue in EUR` = sum(as.numeric(`Revenue in EUR`)),
`Weight in KG` = sum(as.numeric(`Weight in KG`)),
`Number of Materials` = length(`Materials`),
`Avg of deliveries` = mean(as.numeric(`Deliveries`)))
%>% ungroup()))
results with your given files给定文件的结果
# A tibble: 6 x 5
ABC `Revenue in EUR` `Weight in KG` `Number of Materials` `Avg of deliveries`
<chr> <dbl> <dbl> <int> <dbl>
1 A 1804802. 4.70 1 1
2 B 900000 9.15 1 5
3 C 13089. 39.8 4 5.5
4 A 3607233 110 2 37.5
5 B 1206978 28.1 2 4.5
6 C 15144. 66.3 4 17.2
I like to write functions, so I would do it like this (although longer it creates a more stable environment to modify/debug when required).我喜欢编写函数,所以我会这样做(尽管更长的时间它会创建一个更稳定的环境来在需要时进行修改/调试)。
# Main Function
main_function <- function(import, name){
main_function.create_path() -> path
main_function.create_output() -> output
for(file in list.files(path){
if(!str_detect(file, 'csv')){
next
}
read_excel(file, sheet = "Deutsch", skip = 7) -> data
main_function.calculate_values(data) -> data.values
main_function.append_values(file, data, data.values, output) -> output
}
main_function.export(path, output, name)
if(import){
assign('values', output, envir = .Globalenv)
}
}
# Functions
main_function.export <- function(path, output, name){
write.csv(output, file = paste0(path, name, '.csv'))
}
main_function.append_values <- function(file, data, data.values, output){
# This will create a row in the output file with the name of the file
# without the .csv at the end in the first column and put in the
# calculated data in the other columns
str_extract(file, ".+(?=.csv)") -> output[nrow(output) + 1, 'file']
for(col in colnames(data.values)){
data.values[,col] -> output[nrow(output), col]
return(output)
}
main_function.calculate_values <- function(data){
data %>% group_by(ABC) %>%
summarize(`Revenue in EUR` = sum(`Revenue in Eur`, na.rm=TRUE),
....) -> data
return(data)
}
main_function.create_path <- function(){
'<path to files>' -> path
return(path)
}
main_function.create_output <- function(){
data.frame('file' = as.character(NA), 'Revenue in EUR' = 0,
'Weight in KG' = 0, 'Number of Materials' = 0, 'Avg of deliveries' = 0) -> output
return(output)
}
This will create the main_function
that when called will cycle through all of the files listed in the path given and read it, process it, save it to output
which which will be saved in the same path with the name you give it.这将创建
main_function
,当调用该函数时,它将循环遍历给定路径中列出的所有文件并读取它,处理它,将其保存到output
,它将保存在与您给它的名称相同的路径中。 If you set import
to TRUE it will also save the output如果将
import
设置为 TRUE,它还将保存 output
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.