简体   繁体   English

如何从 csv 中的多个文件计算平均值

[英]How to calculate mean from multiple file in csv

Using this option in python it is possible to calculate the mean from multiple csv file在 python 中使用此选项可以从多个 csv 文件中计算平均值

If file1.csv through file100.csv are all in the same directory, you can use this Python script:如果 file1.csv 到 file100.csv 都在同一个目录下,你可以使用这个 Python 脚本:

#!/usr/bin/env python3

N = 100
mean_sum = 0
std_sum = 0
for i in range(1, N + 1):
    with open(f"file{i}.csv") as f:
        mean_sum += float(f.readline().split(",")[1])
        std_sum += float(f.readline().split(",")[1])

print(f"Mean of means: {mean_sum / N}")
print(f"Mean of stds: {std_sum / N}")

How is it possible to make it in R?如何在 R 中实现它?

"all can be coded", Erik:) “一切都可以编码”,埃里克:)

It is difficult to help, if you do not provide a minimal reproducible example and describe what you attempted so far and where things go wrong for you.如果您不提供最小的可重现示例并描述您迄今为止尝试的内容以及 go 对您来说错的地方,那么很难提供帮助。

The following is based on {tidyverse} ;以下基于{tidyverse} a set of packages that work well together.一组可以很好地协同工作的软件包。 I write in almost pseudo-code that should get you going.我写的几乎是pseudo-code ,应该能让你继续前进。 Obviously, you will have to adapt, rename to fit your project/variable names, etc.显然,您将不得不适应、重命名以适合您的项目/变量名称等。

Good luck:祝你好运:

library(readr)     # package to read tabular data
library(dplyr)     # main working horse to crunch data
library(purrr)     # functional programming for iterations/loops

pth <- "my-data-folder"    # provide path to your data

# create a list of file names in your folder
## you may need to fine-tune the regular pattern to select the files you look for
## full.names gives you the path/name of your data files
## \\.csv is the way to "escape" the dot of the csv type ending

fns <- list.files(path = pth, pattern = "*file.*\\.csv", full.names = TRUE)

# write a function that reads the file and calculates your stats
## you can "summarise" stats over a table

my_function <- function(.fn){
  df <- read_csv(.fn)     # read the file
  df <- df %>% 
    summarise(MEAN = mean(my-target-variable)    # calc mean of your file/data
              , SD = sd(my-target-variable)      # calc sd of the data
}

# iterate with purrr::map := take list of filenames and apply your function to each list entry
## map_dfr() provides a data frame, you can use "only" map() to get a list
## for testing purposes you can truncate the list of filenames with fns[1:3] for the
## first 3 files, other

ds <- fns %>% 
   purrr::map_dfr(.f = my_function)

ds

ds is a table with columns MEAN and SD. ds是一个包含 MEAN 和 SD 列的表。

It was kind of fun to think about making this example reproducible, so here's some code to create 100 CSVs each with five columns of random data, read them back in, and do the calculation you want.考虑使这个示例可重现是一件很有趣的事情,所以这里有一些代码来创建 100 个 CSV,每个 CSV 有五列随机数据,读回它们,然后进行你想要的计算。 As @Ray's answer suggests, using map() and its friends is a good way to tidily iterate.正如@Ray 的回答所暗示的那样,使用map()和它的朋友是一种很好的迭代方式。

library(readr)
library(dplyr)
library(tidyr)
library(purrr)

## Make a "tmpdat" folder in the working dir if one doesn't exist
ifelse(!dir.exists(file.path("tmpdat")), dir.create(file.path("tmpdat")), FALSE)

#> [1] TRUE

## Make 100 CSV files, each with 5 columns
## of random data.
set.seed(16)

nvars <- 5

paste0("csv_", 1:100) %>%
  set_names() %>%
  map(~ replicate(n = nvars, rnorm(100, 0, 1))) %>%
  map_dfr(as_tibble, .id = "id", .name_repair = ~ paste0("v", 1:nvars)) %>%
  group_by(id) %>%
  nest() %>%
  pwalk(~ write_csv(x = .y, file = paste0("tmpdat/", .x, ".csv")))

## Get their names
filenames <- dir(path = "tmpdat",
                 pattern = "*.csv",
                 full.names = TRUE)

## Read them in and then
## 1. Calculate the mean and sd of each column in each CSV
## 2. Get the overall mean of means and mean of sds for
filenames %>%
  map_dfr(read_csv, .id = "id", col_types = cols()) %>%
  group_by(id) %>%
  summarize(across(everything(),
                   list(mean = mean, sd = sd))) %>%
  pivot_longer(-id,
               names_to = c("col", ".value"), names_sep="_") %>%
  group_by(col) %>%
  summarize(avg_mean = mean(mean),
            avg_sd = mean(sd))


#> # A tibble: 5 x 3
#>   col   avg_mean avg_sd
#>   <chr>    <dbl>  <dbl>
#> 1 v1    -0.00433  1.01 
#> 2 v2     0.00124  0.989
#> 3 v3    -0.00185  0.997
#> 4 v4     0.00431  0.991
#> 5 v5    -0.00502  0.996

If you just want a single overall mean and overall sd (rather than one for each column across all the CSVs) then this would be simpler, as you could just pivot the CSV variables into a single vector grouped by file id and take the mean and sd of that.如果您只想要一个整体平均值和整体 sd(而不是所有 CSV 中的每列一个),那么这会更简单,因为您可以只将 pivot 和 CSV 变量放入按文件 id 分组的单个向量中,然后取平均值和sd那个。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM