简体   繁体   English

如何在 R 中为目录中的每个文件运行描述性统计数据,并将 append 运行到数据框(txt 文件最终 output 作为表格)?

[英]How do I run descriptive statistics in R for each file in a directory and append it to a data frame (txt file final output as a table)?

I have a folder of files (csv) that have filtered/gated data -- two columns (dihedral angle vs bend angle).我有一个包含过滤/门控数据的文件(csv)文件夹——两列(二面角与弯曲角)。 It was filtered based upon an individualized min and max for each file.它是根据每个文件的个性化最小值和最大值进行过滤的。

I need to be able to get at least the mean, median, sd, skewness, and kurtosis for each column of each file and have that data presented as a table.我需要至少能够获得每个文件每一列的平均值、中位数、标准差、偏度和峰度,并将这些数据显示为表格。 (One line per file in the final product) (最终产品中每个文件一行)

I am not familiar with what R packages that maybe suitable for this task, so I was trying to do something simple.我不熟悉可能适合此任务的 R 包,所以我试图做一些简单的事情。 I can get it to work for a single file, but I have over 200 files.我可以让它为单个文件工作,但我有超过 200 个文件。 They will likely be updating, so I'll have to run this multiple times.他们可能会更新,所以我必须多次运行它。

module load ccs/container/R/4.1.0
R

library(moments)

files <- list.files("/mnt/gpfs2_4m/scratch/username/fs_scripts/foldedstart_*", pattern="*.csv", recursive=TRUE, full.names=TRUE)

cat("filename","\t","dihedral mean","\t","bend mean","\t","dihedral median","\t","bend median","\t","dh sd","\t","bd sd","\t","dh skew","\t","bd skew","\t","dh kurt","\t","bd kurt","\n")

for (currentFile in files) {
  df <- read.table(fileName[i], header=TRUE)

  z1 <- mean(df$V1)
  z2 <- median(df$V1)
  z3 <- sd(df$V1)
  z4 <- skewness(df$V1)
  z5 <- kurtosis(df$V1)

  z7 <- mean(df$V2)
  z8 <- median(df$V2)
  z9 <- sd(df$V2)
  z10 <- skewness(df$V2)
  z11 <- kurtosis(df$V2)
  
  cat(filename,"\t",z1,"\t",z7,"\t",z2,"\t",z8,"\t",z3,"\t",z9,"\t",z4,"\t",z10,"\t",z5,"\t",z11,"\n")

  write.table(newdata, file=statsFileName[i]))
}

The "first cat line" is the header and labels. “第一条猫线”是 header 和标签。

The "for cat line" likely goes "no where," but it is the format that I am trying to achieve. “for cat line”可能是“nowhere”,但这是我想要实现的格式。

The "write.table line" is something that I found, but I don't think it may be appropriate for this. “write.table line”是我发现的,但我认为它可能不适合这个。

I truly appreciate any help on this.我真的很感激这方面的任何帮助。 I am not that familiar with R and the examples that I have found do not appear to relate enough to what I trying to do for me to adapt them.我对 R 不是很熟悉,而且我发现的示例似乎与我尝试为适应它们所做的事情没有足够的关联。

The following computes all statistics the question asks for for each file and writes a table of results to a CSV file.下面计算问题要求的每个文件的所有统计数据,并将结果表写入 CSV 文件。

library(moments)
#
stats <- function(filename, na.rm = TRUE) {
  x <- read.csv(filename)
  xbar <- colMeans(x, na.rm = na.rm)
  med <- apply(x, 2, median, na.rm = na.rm)
  S <- apply(x, 2, sd, na.rm = na.rm)
  skwn <- skewness(x, na.rm = na.rm)
  kurt <- kurtosis(x, na.rm = na.rm)
  #
  # return a data.frame, it will 
  # make the code simpler further on
  out <- data.frame(
    filename = filename, 
    dihedral.mean = xbar[1],
    bend.mean = xbar[2],
    dihedral.median = med[1],
    bend.median = med[2],
    dihedral.sd = S[1],
    bend.sd = S[2],
    dihedral.skewness = skwn[1],
    bend.skewness = skwn[2],
    dihedral.kurtosis = kurt[1],
    bend.kurtosis = kurt[2]
  )
  row.names(out) <- NULL
  out
}

statsFileName <- "statsfile.txt"

#files <- list.files("/mnt/gpfs2_4m/scratch/username/fs_scripts/foldedstart_*", pattern="*.csv", recursive=TRUE, full.names=TRUE)
files <- list.files("~/Temp", "^t.*\\.csv$")

newdata <- lapply(files, stats)
newdata <- do.call(rbind, newdata)

write.csv(newdata, file = statsFileName, row.names = FALSE)

This solution uses dplyr to summarise each file, combines the summaries into a single dataframe, then writes the results to a csv file.此解决方案使用 dplyr 汇总每个文件,将汇总合并到单个 dataframe 中,然后将结果写入 csv 文件。

library(moments)
library(dplyr)

### Create dummy csv files for reproducibility ###
if(!dir.exists("./data/")) dir.create("./data/")
for(i in 1:200){
    write.csv(data.frame(V1 = runif(100), V2 = runif(100)),
              file = paste0("./data/file_", i, ".csv"),
              row.names = FALSE)
}

### Summarise files ###
files <- list.files("./data", full.names = TRUE)
all_results <- vector("list", length(files)) # results placeholder

# Loop that calculates summary statistics
for (i in 1:length(files)) {
    currentFile <- files[i]
    df <- read.csv(file = currentFile, header=TRUE)
    result <- df %>% summarise_all(list(mean = mean, median = median, 
                              sd = sd, skew = skewness, kur = kurtosis))%>% 
        mutate(file = currentFile) %>% # add filename to the result
        select(file, everything()) # reorder 
    all_results[[i]] <- result
}

# Combine results into a single df
final_table <- bind_rows(all_results)

# write file
write.csv(final_table, "results.csv", row.names = FALSE)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM