简体   繁体   English

如何使用python计算几个.dat文件的平均值?

[英]How to calculate the average of several .dat files using python?

So I have 50 - 60 .dat files all containing m rows and n columns of numbers. 所以我有50-60个.dat文件,所有文件都包含m行和n列数字。 I need to take the average of all of the files and create a new file in the same format. 我需要取所有文件的平均值,并以相同的格式创建一个新文件。 I have to do this in python. 我必须在python中这样做。 Can anyone help me with this? 谁能帮我这个?

I've written some code.. I realize I have some incompatible types here, but I can't think of an alternative so I haven't change anything yet. 我写了一些代码..我知道这里有一些不兼容的类型,但我想不出另类,所以我还没有改变任何东西。

#! /usr/bin/python
import os

CC = 1.96

average = []
total = []
count = 0
os.chdir("./")
for files in os.listdir("."):
    if files.endswith(".dat"):
        infile = open(files)
        cur = []
        cur = infile.readlines()
        for i in xrange(0, len(cur)):
            cur[i] = cur[i].split()
        total += cur
        count += 1
average = [x/count for x in total]

#calculate uncertainty
uncert = []

for files in os.listdir("."):
    if files.endswith(".dat"):
        infile = open(files)
        cur = []
        cur = infile.readlines
        for i in xrange(0, len(cur)):
            cur[i] = cur[i].split()
        uncert += (cur - average)**2
uncert = uncert**.5
uncert = uncert*CC

Here's a fairly time- and resource-efficient approach which reads in the values and calculates their averages for all the files in parallel, yet only reads in one line per file at a time -- however it does temporarily read the entire first .dat file into memory in order to determine how many rows and columns of numbers are going to be in each file. 这是一种相当时间和资源有效的方法,它读取值并并行计算所有文件的平均值,但每次只读取一行文件 - 但它会暂时读取整个第一个.dat文件进入内存以确定每个文件中将有多少行和每列数字。

You didn't say if your "numbers" were integer or float or what, so this reads them in as floating point (which will work even if they're not). 你没有说你的“数字”是整数还是浮点数或什么,所以这将它们作为浮点读取(即使它们不存在也会起作用)。 Regardless, the averages are calculated and output as floating point numbers. 无论如何,平均值被计算并输出为浮点数。

Update 更新

I've modified my original answer to also calculate a population standard deviation ( sigma ) of the values in each row and column, as per your comment. 我已经修改了我的原始答案,还根据您的评论计算了每行和每列中值的总体标准差( sigma )。 It does this right after it computes their mean value so a second pass to re-read the all the data isn't necessary. 它在计算它们的平均值之后立即执行此操作,因此不需要再次读取所有数据。 In addition, in response to a suggestion made in the comments, a context manager has been added to ensure that all the input files are get closed. 此外,为了响应注释中的建议,添加了上下文管理器以确保关闭所有输入文件。

Note that the standard deviations are only printed and are not written to the output file, but doing that to the same or separate file should to be easy enough to add. 请注意,标准偏差仅打印并且不会写入输出文件,但对相同或单独的文件执行此操作应该很容易添加。

from contextlib import contextmanager
from itertools import izip
from glob import iglob
from math import sqrt
from sys import exit

@contextmanager
def multi_file_manager(files, mode='rt'):
    files = [open(file, mode) for file in files]
    yield files
    for file in files:
        file.close()

# generator function to read, convert, and yield each value from a text file
def read_values(file, datatype=float):
    for line in file:
        for value in (datatype(word) for word in line.split()):
            yield value

# enumerate multiple egual length iterables simultaneously as (i, n0, n1, ...)
def multi_enumerate(*iterables, **kwds):
    start = kwds.get('start', 0)
    return ((n,)+t for n, t in enumerate(izip(*iterables), start))

DATA_FILE_PATTERN = 'data*.dat'
MIN_DATA_FILES = 2

with multi_file_manager(iglob(DATA_FILE_PATTERN)) as datfiles:
    num_files = len(datfiles)
    if num_files < MIN_DATA_FILES:
        print('Less than {} .dat files were found to process, '
              'terminating.'.format(MIN_DATA_FILES))
        exit(1)

    # determine number of rows and cols from first file
    temp = [line.split() for line in datfiles[0]]
    num_rows = len(temp)
    num_cols = len(temp[0])
    datfiles[0].seek(0)  # rewind first file
    del temp  # no longer needed
    print '{} .dat files found, each must have {} rows x {} cols\n'.format(
           num_files, num_rows, num_cols)

    means = []
    std_devs = []
    divisor = float(num_files-1)  # Bessel's correction for sample standard dev
    generators = [read_values(file) for file in datfiles]
    for _ in xrange(num_rows):  # main processing loop
        for _ in xrange(num_cols):
            # create a sequence of next cell values from each file
            values = tuple(next(g) for g in generators)
            mean = float(sum(values)) / num_files
            means.append(mean)
            means_diff_sq = ((value-mean)**2 for value in values)
            std_dev = sqrt(sum(means_diff_sq) / divisor)
            std_devs.append(std_dev)

print 'Average and (standard deviation) of values:'
with open('means.txt', 'wt') as averages:
    for i, mean, std_dev in multi_enumerate(means, std_devs):
        print '{:.2f} ({:.2f})'.format(mean, std_dev),
        averages.write('{:.2f}'.format(mean))  # note std dev not written
        if i % num_cols != num_cols-1:  # not last column?
             averages.write(' ')  # delimiter between values on line
        else:
            print  # newline
            averages.write('\n')

I am not sure which aspect of the process is giving you the problem, but I will just answer specifically about getting the averages of all the dat files. 我不确定该过程的哪个方面可以解决您的问题,但我将特别回答有关获取所有dat文件的平均值的问题。

Assuming a data structure like this: 假设这样的数据结构:

72 12 94 79 76  5 30 98 97 48 
79 95 63 74 70 18 92 20 32 50 
77 88 60 98 19 17 14 66 80 24 
...

Getting averages of the files: 获取文件的平均值:

import glob
import itertools

avgs = []

for datpath in glob.iglob("*.dat"):
    with open(datpath, 'r') as f:
        str_nums = itertools.chain.from_iterable(i.strip().split() for i in f)
        nums = map(int, str_nums)
        avg = sum(nums) / len(nums)
        avgs.append(avg)

print avgs

It loops over each .dat file, reads and joins the lines. 它遍历每个.dat文件,读取和连接行。 Converts them to int (could be float if you want) and appends the avg. 将它们转换为int(如果需要可以浮动)并附加平均值。

If these files are enormous and you are concerned about the amount of memory when reading them in, you could more explicitly loop over each line and only keep counter, the way your original example was doing: 如果这些文件非常庞大并且您在阅读它们时会关注内存量,那么您可以更明确地遍历每一行并且只保留计数器,就像您的原始示例所做的那样:

for datpath in glob.iglob("*.dat"):
    with open(datpath, 'r') as f:
        count = 0
        total = 0
        for line in f:
            nums = [int(i) for i in line.strip().split()]
            count += len(nums)
            total += sum(nums)
        avgs.append(total / count)
  • Note: I am not handling exceptional cases, such as the file being empty and producing a Divide By Zero situation. 注意:我没有处理特殊情况,例如文件为空并产生Divide By Zero情况。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM