简体   繁体   English

使用rbind()将多个数据帧组合成lapply()中的一个更大的data.frame

[英]Using rbind() to combine multiple data frames into one larger data.frame within lapply()

I'm using R-Studio 0.99.491 and R version 3.2.3 (2015-12-10). 我正在使用R-Studio 0.99.491和R版本3.2.3(2015-12-10)。 I'm a relative newbie to R, and I'd appreciate some help. 我是R的相对新手,我会感激一些帮助。 I'm doing a project where I'm trying to use the server logs on an old media server to identify which folders/files within the server are still being accessed and which aren't, so that my team knows which files to migrate. 我正在做一个项目,我正在尝试使用旧媒体服务器上的服务器日志来识别服务器中的哪些文件夹/文件仍然被访问,哪些不是,因此我的团队知道要迁移哪些文件。 Each log is for a 24 hour period, and I have approximately a year's worth of logs, so in theory, I should be able to see all of the access over the past year. 每个日志是24小时,我有大约一年的日志,所以理论上,我应该能够看到过去一年的所有访问。

My ideal output is to get a tree structure or plot that will show me the folders on our server that are being used. 我理想的输出是获得一个树结构或图表,它将显示我们服务器上正在使用的文件夹。 I've figured out how to read one log (one day) into R as a data.frame and then use the data.tree package in R to turn that into a tree. 我已经想出如何将一个日志(一天)读入R作为data.frame,然后使用R中的data.tree包将其转换为树。 Now, I want to recursively go through all of the files in the directory, one by one, and add them to that original data.frame, before I create the tree. 现在,我想逐步遍历目录中的所有文件,并在创建树之前将它们添加到原始data.frame中。 Here's my current code: 这是我目前的代码:

#Create the list of log files in the folder
files <- list.files(pattern = "*.log", full.names = TRUE, recursive = FALSE)
#Create a new data.frame to hold the aggregated log data
uridata <- data.frame()
#My function to go through each file, one by one, and add it to the 'uridata' df, above
lapply(files, function(x){
    uriraw <- read.table(x, skip = 3, header = TRUE, stringsAsFactors = FALSE)
    #print(nrow(uriraw)
    uridata <- rbind(uridata, uriraw)
    #print(nrow(uridata))
})

The problem is that, no matter what I try, the value of 'uridata' within the lapply loop seems to not be saved/passed outside of the lapply loop, but is somehow being overwritten each time the loop runs. 问题在于,无论我尝试什么,lapply循环中的'uridata'的值似乎都不会保存/传递到lapply循环之外,但每次循环运行时都会被覆盖。 So instead of getting one big data.frame, I just get the contents of the last 'uriraw' file. 因此,我只获取最后一个'uriraw'文件的内容,而不是获取一个大数据框架。 (That's why there are those two commented print commands inside the loop; I was testing how many lines there were in the data frames each time the loop ran.) (这就是为什么在循环中有这两个注释的打印命令;我每次循环运行时都在测试数据帧中有多少行。)

Can anyone clarify what I'm doing wrong? 谁能澄清我做错了什么? Again, I'd like one big data.frame at the end that combines the contents of each of the (currently seven) log files in the folder. 同样,我想在最后组合一个大数据框,它将文件夹中每个(当前七个)日志文件的内容组合在一起。

do.call() is your friend. do.call()是你的朋友。

big.list.of.data.frames <- lapply(files, function(x){
    read.table(x, skip = 3, header = TRUE, stringsAsFactors = FALSE)
})

or more concisely (but less-tinkerable): 或更简洁(但不太可修):

big.list.of.data.frames <- lapply(files, read.table, 
                                  skip = 3,header = TRUE,
                                  stringsAsFactors = FALSE)

Then: 然后:

big.data.frame <- do.call(rbind,big.list.of.data.frames)

This is a recommended way to do things because "growing" a data frame dynamically in R is painful. 这是一种推荐的做事方式,因为在R中动态地“增长”数据帧是很痛苦的。 Slow and memory-expensive, because a new frame gets built at each iteration. 速度慢且内存昂贵,因为每次迭代都会构建一个新帧。

可以使用map_dfpurrr包,而不是lapply ,直接具有所有结果组合成数据帧。

map_df(files, read.table, skip = 3, header = TRUE, stringsAsFactors = FALSE)

Another option is fread from data.table 另一种选择是freaddata.table

library(data.table)
rbindlist(lapply(files, fread, skip=3))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM