简体   繁体   中英

How to create one dataframe from multiple csv files in a folder

I have a list of CSV files(A1.csv, A2.csv........D10.csv) in a folder which contains data two columns but several rows. Basically, I want to extract the values of last row and 2nd column from all the csv files See the picture to understand better

and create a data frame which will contain file name in 1st column and the extracted values(C) in the second column.

Now, I can do it by creating another list of CSV files and concatenate them later into one data frame.

Is it possible to store each data frame produced by CSV files into a list and then concatenate them (what rbind do in R). I tried this code in R, it works. But I want to learn the more efficient way in R or python.( Python is preferable as I am trying to learn python)

#read through csv files and select the last row 2nd column
m=c(NULL)
aa=c(NULL)
f=list.files(path = getwd(),pattern = '.*csv')
for (g in f){
aa=read.csv(g)
m=tail(aa,1)
q=m[,2]
yy=data.frame(ID=g,Final=q)
write.csv(yy,file = paste("Filename/",g),row.names = F)
}
###concatanate into one file
readFile=list.files(path = getwd(),pattern = "*.csv")
Alldata=lapply(readFile,function(filename){
dummy=read.csv(filename)
return(dummy)
})
FinalFIle=do.call(rbind,Alldata)
write.csv(FinalFIle,file = "FinalFIle.csv",row.names = F)

Here is an option in R.

Step 1: Prepare a vector with file names. If there are too many files in the folder, the list.files function could be useful. Here, I just manually created it. I also assume that all the files are stored in the working directory. Otherwise, you will need to construct the file path.

file_vec <- c("A1.csv", "A2.csv", "A3.csv")

Step 2: Read all CSV file based on file_vec. The key is to use the lapply function to apply read.csv of every element in file_vec .

dt_list <- lapply(file_vec, read.csv, stringsAsFactors = FALSE)

Step 3: Prepare a vector showing file names without .csv

name_vec <- sub(".csv", "", file_vec)

Step 4: Create the data frame. x[nrow(x), 2] is a way to access the last value of the second column.

dt_final <- data.frame(File = name_vec,
                       Value = sapply(dt_list, function(x) x[nrow(x), 2]),
                       stringsAsFactors = FALSE)

dt_final is the final output.

Here's another option using the tidyverse in R:

library(tidyverse)

# In my example, I'm using a folder with 4 Chicago Crime Datasets
setwd("INSERT/PATH/HERE")

files <- list.files()

tibble(files) %>%
  mutate(file_contents = map(files, ~ read_csv(file.path(.), n_max = 10))) %>% 
  unnest(file_contents) %>%
  group_by(files) %>%
  slice(n()) %>% 
  select(1:2)

Which returns:

# A tibble: 4 x 2
# Groups:   filename [4]
                         filename    X1
                            <chr> <int>
1 Chicago_Crimes_2001_to_2004.csv  4904
2 Chicago_Crimes_2005_to_2007.csv    10
3 Chicago_Crimes_2008_to_2011.csv  5867
4 Chicago_Crimes_2012_to_2017.csv  1891

Note that the n_max = 10 argument isn't needed. I only included this because the files I was working with are pretty large.

For anyone interested, the dataset can be found here .

Also, it's possible that you may want to avoid setting the work directory with setwd() . If this is the case, you can use the additional argument full.names = TRUE in list.files() :

path <- "INSERT/PATH/HERE"
files <- list.files(path, full.names = TRUE)

I'd recommend this approach as scripts containing the line setwd() aren't flexible, paths will change from user to user.

Python Solution

>>> import pandas as pd
>>> files = ['A1.csv', 'A2.csv', ... , 'D10.csv']
>>> df_final = pd.Dataframe({fname: pd.read_csv(fname).iat[-1, 1] for fname in files})

This is an easy case for bash and friends. This one-liner

for i in A*.csv B*.csv C*.csv D*.csv; do awk -F , 'END{ print $NF }' "$i"; done

extracts the bottom right field, no matter how many rows or columns, of any number of files that follow the pattern you have given. If all files were in one in one folder, and they were the only .csv files in that folder, and you wanted to save the outcome in a new file, this would do the job:

for i in *.csv; do awk -F , 'END{ print $NF }' "$i"; done > extract.txt

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM