[英]How to create one dataframe from multiple csv files in a folder
I have a list of CSV files(A1.csv, A2.csv........D10.csv) in a folder which contains data two columns but several rows. 我在一个文件夹中有一个CSV文件列表(A1.csv,A2.csv ........ D10.csv),该文件夹包含两列但几行的数据。 Basically, I want to extract the values of last row and 2nd column from all the csv files See the picture to understand better
基本上,我想从所有CSV文件中提取最后一排和第二列的值, 看到的图片更好地理解
and create a data frame which will contain file name in 1st column and the extracted values(C) in the second column. 并创建一个数据框,该数据框将在第一列包含文件名,在第二列包含提取的值(C)。
Now, I can do it by creating another list of CSV files and concatenate them later into one data frame. 现在,我可以通过创建另一个CSV文件列表并将它们以后连接到一个数据帧中来实现。
Is it possible to store each data frame produced by CSV files into a list and then concatenate them (what rbind do in R). 是否可以将CSV文件产生的每个数据帧存储到列表中,然后将它们连接起来(rbind在R中做什么)。 I tried this code in R, it works.
我在R中尝试了此代码,它可以工作。 But I want to learn the more efficient way in R or python.( Python is preferable as I am trying to learn python)
但是我想学习使用R或python的更有效的方法。(Python是更可取的,因为我正在尝试学习python)
#read through csv files and select the last row 2nd column
m=c(NULL)
aa=c(NULL)
f=list.files(path = getwd(),pattern = '.*csv')
for (g in f){
aa=read.csv(g)
m=tail(aa,1)
q=m[,2]
yy=data.frame(ID=g,Final=q)
write.csv(yy,file = paste("Filename/",g),row.names = F)
}
###concatanate into one file
readFile=list.files(path = getwd(),pattern = "*.csv")
Alldata=lapply(readFile,function(filename){
dummy=read.csv(filename)
return(dummy)
})
FinalFIle=do.call(rbind,Alldata)
write.csv(FinalFIle,file = "FinalFIle.csv",row.names = F)
Here is an option in R. 这是R中的一个选项。
Step 1: Prepare a vector with file names. 步骤1:准备一个带有文件名的向量。 If there are too many files in the folder, the
list.files
function could be useful. 如果文件夹中的文件太多,则
list.files
函数可能会很有用。 Here, I just manually created it. 在这里,我只是手动创建的。 I also assume that all the files are stored in the working directory.
我还假定所有文件都存储在工作目录中。 Otherwise, you will need to construct the file path.
否则,您将需要构造文件路径。
file_vec <- c("A1.csv", "A2.csv", "A3.csv")
Step 2: Read all CSV file based on file_vec. 第2步:读取基于file_vec的所有CSV文件。 The key is to use the
lapply
function to apply read.csv
of every element in file_vec
. 关键是使用
lapply
函数来应用read.csv
中每个元素的file_vec
。
dt_list <- lapply(file_vec, read.csv, stringsAsFactors = FALSE)
Step 3: Prepare a vector showing file names without .csv
步骤3:准备一个显示不带
.csv
文件名的向量
name_vec <- sub(".csv", "", file_vec)
Step 4: Create the data frame. 步骤4:创建数据框。
x[nrow(x), 2]
is a way to access the last value of the second column. x[nrow(x), 2]
是访问第二列的最后一个值的方法。
dt_final <- data.frame(File = name_vec,
Value = sapply(dt_list, function(x) x[nrow(x), 2]),
stringsAsFactors = FALSE)
dt_final
is the final output. dt_final
是最终输出。
Here's another option using the tidyverse
in R: 这是在R中使用
tidyverse
的另一个选项:
library(tidyverse)
# In my example, I'm using a folder with 4 Chicago Crime Datasets
setwd("INSERT/PATH/HERE")
files <- list.files()
tibble(files) %>%
mutate(file_contents = map(files, ~ read_csv(file.path(.), n_max = 10))) %>%
unnest(file_contents) %>%
group_by(files) %>%
slice(n()) %>%
select(1:2)
Which returns: 哪个返回:
# A tibble: 4 x 2
# Groups: filename [4]
filename X1
<chr> <int>
1 Chicago_Crimes_2001_to_2004.csv 4904
2 Chicago_Crimes_2005_to_2007.csv 10
3 Chicago_Crimes_2008_to_2011.csv 5867
4 Chicago_Crimes_2012_to_2017.csv 1891
Note that the n_max = 10
argument isn't needed. 请注意,
n_max = 10
参数。 I only included this because the files I was working with are pretty large. 我之所以只包括它,是因为我使用的文件很大。
For anyone interested, the dataset can be found here . 对于任何感兴趣的人,都可以在此处找到数据集。
Also, it's possible that you may want to avoid setting the work directory with setwd()
. 另外,您可能希望避免使用
setwd()
设置工作目录。 If this is the case, you can use the additional argument full.names = TRUE
in list.files()
: 在这种情况下,可以在
list.files()
使用附加参数full.names = TRUE
:
path <- "INSERT/PATH/HERE"
files <- list.files(path, full.names = TRUE)
I'd recommend this approach as scripts containing the line setwd()
aren't flexible, paths will change from user to user. 我建议采用这种方法,因为包含
setwd()
行的脚本不灵活,路径会因用户而setwd()
。
Python Solution Python解决方案
>>> import pandas as pd
>>> files = ['A1.csv', 'A2.csv', ... , 'D10.csv']
>>> df_final = pd.Dataframe({fname: pd.read_csv(fname).iat[-1, 1] for fname in files})
This is an easy case for bash
and friends. 对于
bash
和朋友来说,这是一个简单的案例。 This one-liner 这个单线
for i in A*.csv B*.csv C*.csv D*.csv; do awk -F , 'END{ print $NF }' "$i"; done
extracts the bottom right field, no matter how many rows or columns, of any number of files that follow the pattern you have given. 无论遵循多少行或多少列,都将提取遵循您提供的模式的任意数量的文件的右下角字段。 If all files were in one in one folder, and they were the only
.csv
files in that folder, and you wanted to save the outcome in a new file, this would do the job: 如果所有文件都在一个文件夹中,并且它们是该文件夹中唯一的
.csv
文件,并且您想将结果保存在一个新文件中,则可以完成以下工作:
for i in *.csv; do awk -F , 'END{ print $NF }' "$i"; done > extract.txt
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.