简体   繁体   中英

Automate a series of actions done on a single .csv file to all .csv files within the same directory in R

I'm working on a research project where I need to process data from a pair of tactile gloves. After exporting the data, there are 4 rows containing date and time that I don't need when doing analysis after, and there are a lot of columns that I also don't need. Long story short, I needed to delete the first 4 rows and only keep columns [1,2,33,53,76,95,114,133,164,184,207,226,245]. I wrote a pretty simple R script to do it for me, but I'm wondering how can I apply this set of operations to all.csv files in the same directory? Manually typing each file name every time is pretty painful. Thank you in advance!

# read uncleaned, raw, data
uncleaned_data<-read.csv("C:/Users/jiang/Desktop/Ready_Clean/Hongjiao_Medium_High1.csv", header = FALSE)

# remove the date and time headers
data_without_head<-uncleaned_data[-c(1,2,3,4),]

# extract the useful columns
cleaned_data<-data_without_head[,c(1,2,33,53,76,95,114,133,164,184,207,226,245)]

# write the new cleaned data into a new file name (adding "_cleaned" in the end)
write.table(cleaned_data,"C:/Users/jiang/Desktop/Ready_Clean/Hongjiao_Medium_High1_Cleaned.csv",row.names=FALSE,col.names=FALSE,sep=",")

You can list all the files in the directory and then filter the ones ending with.csv:

I assumed that your directory path is "C:/Users/jiang/Desktop/Ready_Clean/"

unfortunately i cant test the code in my pc but let me know if you have some questions.

library(tidyverse)
library(stringr)

#get all the .csvs present in the directory and then fabricate the new names just by appending '_cleaned' before .csv 

paths <- list.files(path = "C:/Users/jiang/Desktop/Ready_Clean/") %>%
          str_subset(pattern = '.csv$') #capture all the files ending in .csv


paths <- str_c("C:/Users/jiang/Desktop/Ready_Clean/", paths)


paths_cleaned <- str_replace(paths, '.csv$', '_cleaned.csv')

get_csv <- function(path, path_clean){
    # read uncleaned, raw, data
    uncleaned_data    <- read.csv(path, header = FALSE)
    
    # remove the date and time headers
    data_without_head <- uncleaned_data[-c(1,2,3,4),]
    
    # extract the useful columns
    cleaned_data      <- data_without_head[, c(1,2,33,53,76,95,114,133,164,184,207,226,245)]
    
    # write the new cleaned data into a new file name (adding "_cleaned" in the end)
    write.table(cleaned_data,
                path_clean,
                row.names = FALSE,
                col.names = FALSE,
                sep = ",")
}

#walk2 would also be an option because we only care of side-effects here.
map2(path, path_cleaned, ~get_csv(.x, .y))

A Base R solution looks like this. First, we use list.files() to extract files ending with .csv , then use use the file list to drive lapply() to read the data, subset it, and write with write.table() .

theFiles <- list.files(path="C:/Users/jiang/Desktop/Ready_Clean/",
                       pattern="\\.csv$",full.names=TRUE)
dataList <- lapply(theFiles,function(x){
     y <- read.csv(x,skip = 4,header=FALSE)[c(1,2,33,53,76,95,114,133,164,184,207,226,245)]
     write.table(y,paste0(x,".cleaned"))
})

Note that we use the skip = argument to skip the first four rows when reading each file, then immediately subset the object created by read.csv() via the [ form of the extract operator.

in the write.table() operation we use paste0() to append .cleaned to each original file name to distinguish the cleaned files from the originals.

Since the original question does not include a minimal reproducible example, we'll use the data from my Pokémon Stats GitHub repository to illustrate the solution.

The dimensionality of the Pokémon stats data is much different from the data described in the original question, so we'll skip the first four rows of each file, and retain only columns 1, 2, 4, and 6.

download.file("https://raw.githubusercontent.com/lgreski/pokemonData/master/PokemonData.zip",
                  "pokemonData.zip",mode="wb")
    unzip("pokemonData.zip",exdir="./pokemonData")


theFiles <- list.files("./pokemonData",pattern="\\.csv$",full.names=TRUE)
dataList <- lapply(theFiles,function(x){
     y <- read.csv(x,skip = 4,header=FALSE)[c(1,2,4,6)]
     write.table(y,file=paste0(x,".cleaned"),row.names=FALSE,col.names=FALSE,sep=",")
})

A screenshot of one of the original files can be used to verify the output. I have highlighted columns 1, 2, 4, and 6, starting with the fourth row of input (including the header row).

在此处输入图像描述

...and the output for the first few rows of ./pokemonData/gen01.csv.cleaned is:

4,"Charmander","Fire",309
5,"Charmeleon","Fire",405
6,"Charizard","Fire",534
7,"Squirtle","Water",314
8,"Wartortle","Water",405
9,"Blastoise","Water",530

The file gen01.csv contains the first generation Pokémon. The first three Pokémon in this file are Bulbasaur, Ivysaur, and Vensuaur. We can see from the output that these Pokémon and the header row in the original file were skipped, so the first observation is Pokémon 4, Charmander. We also see that the Total stat, the sixth column, matches the input file for the rows that have been written to the output file.

Validating the written files

Because we appended .cleaned at the end of each file we can use the same technique to list the .cleaned files as we did to list the .csv files and read them with read.csv() . This allows us to keep the original files distanct from the cleaned files.

# now read the cleaned files
theFiles <- list.files("./pokemonData",pattern="\\.cleaned$",full.names=TRUE)
dataList <- lapply(theFiles,read.csv,header=FALSE)
head(dataList[[1]])

At this point the dataList object is a list() that contains 8 data frames, one for each generation of Pokémon.

We use head() to print the first few rows of the first data frame in the list, which matches the results above:

> head(dataList[[1]])
  V1         V2    V3  V4
1  4 Charmander  Fire 309
2  5 Charmeleon  Fire 405
3  6  Charizard  Fire 534
4  7   Squirtle Water 314
5  8  Wartortle Water 405
6  9  Blastoise Water 530

Writing the cleaned files to a separate directory

Per the request made in the comments to my answer, here is a solution that creates a /cleaned subdirectory within the directory where the files were originally stored, and writes the files to that directory.

First, we create objects for the input and output directories. Then we create a new subdirectory for the output files if it does not already exist.

# solution that creates a ./cleaned subdirectory

inputDirectory <- "./pokemonData"
outputDirectory <- paste0(inputDirectory,"/cleaned")
if(!dir.exists(outputDirectory)) dir.create(outputDirectory)

By checking whether the directory exists before attempting to create it, we eliminate errors on the second and subsequent runs of this script.

Next, we list the files in the input directory. Because we're doing to use the inputDirectory and outputDirectory objects later in the script to manually build the full path names for each input and output file, we set the full.names= argument of list.files() to FALSE .

theFiles <- list.files(inputDirectory,pattern="\\.csv$",full.names=FALSE)

Next, we use lapply() to read the files, subset the right rows and columns, and write the cleaned files to the output directory.

dataList <- lapply(theFiles,function(x){
     y <- read.csv(paste0(inputDirectory,"/",x),skip = 4,header=FALSE)[c(1,2,4,6)]
     write.table(y,file=paste0(outputDirectory,"/",x),row.names=FALSE,col.names=FALSE,sep=",")
})

# verify that files were written to cleaned directory
list.files(outputDirectory,full.names=TRUE)

...and the output:

> list.files(outputDirectory,full.names=TRUE)
[1] "./pokemonData/cleaned/gen01.csv" "./pokemonData/cleaned/gen02.csv"
[3] "./pokemonData/cleaned/gen03.csv" "./pokemonData/cleaned/gen04.csv"
[5] "./pokemonData/cleaned/gen05.csv" "./pokemonData/cleaned/gen06.csv"
[7] "./pokemonData/cleaned/gen07.csv" "./pokemonData/cleaned/gen08.csv"
>

Appendix

Since commenters are asserting that the dots in the file names in paste0() aren't being rendered correctly, the following screenshot of the subdirectory demonstrates that the code does indeed work as I intended.

在此处输入图像描述

Hi I did some coding for you to answer your question.

  1. first set working directory
  2. list all the files you need to process. My assumption here is all files starts with "Hongjiao_Medium_High" and then there is some number
  3. use FOR loop over the list of file names to iterate
  4. pasted your code inside the FOR loop with some tweaks

Below is the code:

setwd("C:/Users/jiang/Desktop/Ready_Clean")
list_of_file_names <- list.files(pattern = "*png")

for(i in list_of_file_names){
  # read uncleaned, raw, data
  print(i)
  uncleaned_data<-read.csv( i , header = FALSE)
  
  # remove the date and time headers
  data_without_head<-uncleaned_data[-c(1,2,3,4),]
  
  # extract the useful columns
  cleaned_data<-data_without_head[,c(1,2,33,53,76,95,114,133,164,184,207,226,245)]
  
  # write the new cleaned data into a new file name (adding "_cleaned" in the end)
  write.table(cleaned_data,paste(i,"_Cleaned.csv"),row.names=FALSE,col.names=FALSE,sep=",")
}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM