简体   繁体   English

导入包含多个 .csv 文件的文件夹并在 R 中一次操作所有数据框

[英]import a folder with multiple .csv files and manipulate all dataframe at once in R

I have a folder with 100 different .csv files.我有一个包含 100 个不同 .csv 文件的文件夹。 Not all files are containing the same number of variables (different structure) so I am trying to import them all at once and ( create separate data frame for each csv) then standardize dataframes by adding a new column or convert date column from character to date and export them at once again in the end.并非所有文件都包含相同数量的变量(不同的结构),所以我试图一次导入它们(为每个 csv 创建单独的数据框)然后通过添加新列或将日期列从字符转换为日期来标准化数据框并最终再次导出它们。 here is my try, it will work an read all the csv as a separate data frame这是我的尝试,它将读取所有 csv 作为单独的数据框

setwd(C:/Users/...)
files <- list.files(pattern="*.csv")
for(file in files)
{
  perpos <- which(strsplit(file, "")[[1]]==".")
  assign(
    gsub(" ","",substr(file, 1, perpos-1)), 
    read.csv(paste(path,file,sep="")))
} 

However, when I adding mutate to assign function to add a new column for instance , script will run but will not add any column!但是,当我添加mutateassign函数以添加新列时,脚本将运行但不会添加任何列! What I am missing here?我在这里缺少什么? My aim is add/manipulate some variables and export them again , preferably within tidyverse我的目标是添加/操作一些变量并再次导出它们,最好在 tidyverse 中

for(file in files)
{
  perpos <- which(strsplit(file, "")[[1]]==".")
  assign(
    gsub(" ","",substr(file, 1, perpos-1)), 
    read_csv(paste(path,file,sep="")),
    mutate(. , Heading = "Data"))
} 

Example例子

df1 <- structure(list(datadate = structure(c(17927, 17927, 17927, 17927, 
17927, 17927), class = "Date"), parent = c("grup", "grup", 
"grup", "grup", "grup", "grup"), ads = c("P9", 
"PS8", "PS7", "PS6", "PS5", "PS5"), chl = c("PSS9", 
"PSS8", "PSS7", "PSS6", "PSS5", "PSS5"), 
    average_monthly = c(196586.49, 289829.43, 
    1363529.14, 380446.43, 147296.09, 948669.38), current_month = c(987118.82, 
    1682872.03, 4356755.73, 2225040.29, 922506.21, 5756525.08
    ), current_month_minus_1 = c(585573.1, 
    635763.37, 6551477.37, 818531.11, 255862.51, 1832829.99), 
    current_month_minus_2 = c(0, 0, 0, 0, 0, 
    0, 0, 0, 0, 0)), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, 
-6L))

df2<-
  structure(
    list(
      network = c("STAR", "NPD", "GMD"),
      datadate = structure(c(18259, 18259, 18259)),
      brand = c("grup", "GFK", "MDG"),
      average_weekly = c(140389.14,
                                           10281188.25, 172017.39),
      last_week_avg = c(89303.07,
                                         6918460.99, 110594.64),
      last_week_1_minus_avg = c(141765.83,
                                                 10248501.1, 222484.9),
      last_week_2_minus_avg = c(138043.53,
                                                 9846538.57, 164185.21)

    ),
    class = c("tbl_df", "tbl", "data.frame"),
    row.names = c(NA, -3L)
  )

Base R solution to read the files into a list, the changes required to merge them depend on your data:将文件读入列表的基本 R 解决方案,合并它们所需的更改取决于您的数据:

# Store a scalar of the path containing the csvs: 

example_dir <- "C:/Users/Example_Dir"

# Create a vector of the csv paths: 

files <- file.path(example_dir, list.files(example_dir, pattern = ".*.csv"))

# Create an empty list the same length as the number of files: 

X <- vector("list", length(files))

# Iterate through the files and store them in a list:

X[] <- lapply(seq_along(files), function(i){

    data.frame(read.csv(files[i]), stringsAsFactors = FALSE)

  }
)

Aside from the design of your code, it seems that you are using mutate the wrong way.除了您的代码设计之外,您似乎以错误的方式使用了mutate

In your code, you are placing the mutate call as the 3rd argument of the assign function, which should be the position (the environment of your variable).在您的代码中,您将mutate调用作为assign函数的第三个参数,它应该是位置(变量的环境)。

What you'd really want to write is this:你真正想写的是:

assign(
  gsub(" ","",substr(file, 1, perpos-1)), 
  read_csv(paste(path,file,sep="")) %>% 
    mutate(Heading = "Data"))
} 

If you are not familiar with the pipe operator ( %>% ), I suggest that you read some tutorials like the dplyr vignette which has a paragraph about it.如果您不熟悉管道运算符 ( %>% ),我建议您阅读一些教程,例如dplyr小插图,其中有一段介绍它。

This code means: assign to a variable named after the gsub call the dataframe read from the csv, after mutating it to add the Heading column.这段代码的意思是:在改变它以添加Heading列之后,分配给一个以gsub调用从 csv 读取的数据帧命名的变量。

But, as in hello_friend 's answer, I urge you to rethink your design to work with lists rather than a bunch of variables.但是,就像在hello_friend的回答中一样,我敦促您重新考虑您的设计以使用列表而不是一堆变量。 For this, the tidyverse way is to use the purrr package为此, tidyverse方法是使用purrr

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM