简体   繁体   English

如何计算多个csv文件中数字的平均值?

[英]How to calculate average of numbers from multiple csv files?

I've files like the following as replicates from a simulation experiment I've been doing: 我已经从正在进行的模拟实验中复制了以下文件:

generation, ratio_of_player_A, ratio_of_player_B, ratio_of_player_C

So, the data is something like 因此,数据就像

0, 0.33, 0.33, 0.33

1, 0.40, 0.40, 0.20

2, 0.50, 0.40, 0.10

etc

Now, since I run this is in multiples, I've around ~1000 files for each experiment, giving various such numbers. 现在,由于我以倍数运行,因此每个实验都有大约1000个文件,并给出了不同的数字。 Now, my problem is to average them all for 1 set of experiment. 现在,我的问题是对一组实验取平均值。

Thus, I would like to have a file that contains the average ratio after each generation (averaged over multiple replicates, ie files) 因此,我想拥有一个包含每一代之后的平均比率的文件(在多个重复项(即文件)中平均)

All the replicate output files which need to be averaged are names like output1.csv, output2.csv, output3.csv .....output1000.csv 所有需要平均的复制输出文件的名称都类似于output1.csv,output2.csv,output3.csv ..... output1000.csv

I'd be obliged if someone could help me out with a shell script, or a python script. 如果有人可以用shell脚本或python脚本帮助我,我将有义务。

If I understood well, let's say you have 2 file like those: 如果我了解得很好,假设您有2个这样的文件:

$ cat file1
0, 0.33, 0.33, 0.33
1, 0.40, 0.40, 0.20
2, 0.50, 0.40, 0.10

$ cat file2
0, 0.99, 1, 0.02
1, 0.10, 0.90, 0.90
2, 0.30, 0.10, 0.30

And you want to do the mean between column of both files. 您想在两个文件的列之间进行均值。 So here is a way for the first column : 所以这是第一列的一种方法:

Edit : I found a better way, using pd.concat : 编辑:我发现了更好的方法,使用pd.concat:

all_files = pd.concat([file1,file2]) # you can easily put your 1000 files here
result = {}
for i in range(3): # 3 being number of generations
    result[i] = all_files[i::3].mean()
result_df = pd.DataFrame(result)
result_df
                       0     1     2
ratio_of_player_A  0.660  0.25  0.40
ratio_of_player_B  0.665  0.65  0.25
ratio_of_player_C  0.175  0.55  0.20

Other way with merge, but one needs to perform multiple merges 合并的另一种方式,但是需要执行多次合并

import pandas as pd

In [1]: names = ["generation", "ratio_of_player_A", "ratio_of_player_B", "ratio_of_player_C"]
In [2]: file1 = pd.read_csv("file1", index_col=0, names=names)
In [3]: file2 = pd.read_csv("file2", index_col=0, names=names)
In [3]: file1
Out[3]:     
       ratio_of_player_A  ratio_of_player_B  ratio_of_player_C
generation                                                         
0                        0.33               0.33               0.33
1                        0.40               0.40               0.20
2                        0.50               0.40               0.10    

In [4]: file2
Out[4]: 
            ratio_of_player_A  ratio_of_player_B  ratio_of_player_C
generation                                                         
0                        0.99                1.0               0.02
1                        0.10                0.9               0.90
2                        0.30                0.1               0.30



In [5]: merged_file = file1.merge(file2, right_index=True, left_index=True, suffixes=["_1","_2"])
In [6]: merged_file.filter(regex="ratio_of_player_A_*").mean(axis=1)
Out[6]
generation
0             0.66
1             0.25
2             0.40
dtype: float64

Or this way (a bit faster I guess) : 或者这样(我想快一点):

merged_file.ix[:,::3].mean(axis=1) # player A

You can merge recursively before applying the mean() method if you have more than one file. 如果有多个文件,则可以在应用mean()方法之前递归合并。

If I misunderstood the question, please show us what you expect from file1 and file2. 如果我误解了这个问题,请告诉我们您对文件1和文件2的期望。

Ask if there is something you don't understand. 询问是否有您不了解的内容。

Hope this helps ! 希望这可以帮助 !

The following should work: 以下应该工作:

from numpy import genfromtxt

files = ["file1", "file2", ...]

data = genfromtxt(files[0], delimiter=',')
for f in files[1:]:
    data += genfromtxt(f, delimiter=',')

data /= len(files)

You can load each of the 1000 experiments in a dataframe, sum them all then calculate the mean. 您可以将1000个实验中的每个实验加载到一个数据帧中,对所有实验求和,然后计算平均值。

filepath = tkinter.filedialog.askopenfilenames(filetypes=[('CSV','*.csv')]) #select your files
for file in filepath:
    df = pd.read_csv(file, sep=';', decimal=',')
    dfs.append(df)

temp = dfs[0] #creates a temporary variable to store the df
for i in range(1,len(dfs)): #starts from 1 cause 0 is stored in temp
    temp = temp + dfs[i];
result = temp/len(dfs)

your problem is not very clear.. if i understand it right.. 您的问题不是很清楚..如果我理解正确..

>temp
for i in `ls *csv`
more "$i">>temp;

then you have all the data from different files in one big file. 那么您将来自不同文件的所有数据集中在一个大文件中。 try to load in sqlite database (1. Create a table 2.Insert the data) after that you can query your data like. 尝试加载sqlite数据库(1.创建表2.插入数据)之后,您可以像查询数据一样。 select sum(columns)/count(columns) from yourtablehavingtempdata etc. try to see sqlite since your data is tabular.sqlite will be better suited in my opinion. 从您的tablehavingtempdata等中选择sum(columns)/ count(columns)。尝试查看sqlite,因为您的数据是表格格式。sqlite会更适合我。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM