简体   繁体   中英

How to calculate the average of several .dat files using python?

So I have 50 - 60 .dat files all containing m rows and n columns of numbers. I need to take the average of all of the files and create a new file in the same format. I have to do this in python. Can anyone help me with this?

I've written some code.. I realize I have some incompatible types here, but I can't think of an alternative so I haven't change anything yet.

#! /usr/bin/python
import os

CC = 1.96

average = []
total = []
count = 0
os.chdir("./")
for files in os.listdir("."):
    if files.endswith(".dat"):
        infile = open(files)
        cur = []
        cur = infile.readlines()
        for i in xrange(0, len(cur)):
            cur[i] = cur[i].split()
        total += cur
        count += 1
average = [x/count for x in total]

#calculate uncertainty
uncert = []

for files in os.listdir("."):
    if files.endswith(".dat"):
        infile = open(files)
        cur = []
        cur = infile.readlines
        for i in xrange(0, len(cur)):
            cur[i] = cur[i].split()
        uncert += (cur - average)**2
uncert = uncert**.5
uncert = uncert*CC

Here's a fairly time- and resource-efficient approach which reads in the values and calculates their averages for all the files in parallel, yet only reads in one line per file at a time -- however it does temporarily read the entire first .dat file into memory in order to determine how many rows and columns of numbers are going to be in each file.

You didn't say if your "numbers" were integer or float or what, so this reads them in as floating point (which will work even if they're not). Regardless, the averages are calculated and output as floating point numbers.

Update

I've modified my original answer to also calculate a population standard deviation ( sigma ) of the values in each row and column, as per your comment. It does this right after it computes their mean value so a second pass to re-read the all the data isn't necessary. In addition, in response to a suggestion made in the comments, a context manager has been added to ensure that all the input files are get closed.

Note that the standard deviations are only printed and are not written to the output file, but doing that to the same or separate file should to be easy enough to add.

from contextlib import contextmanager
from itertools import izip
from glob import iglob
from math import sqrt
from sys import exit

@contextmanager
def multi_file_manager(files, mode='rt'):
    files = [open(file, mode) for file in files]
    yield files
    for file in files:
        file.close()

# generator function to read, convert, and yield each value from a text file
def read_values(file, datatype=float):
    for line in file:
        for value in (datatype(word) for word in line.split()):
            yield value

# enumerate multiple egual length iterables simultaneously as (i, n0, n1, ...)
def multi_enumerate(*iterables, **kwds):
    start = kwds.get('start', 0)
    return ((n,)+t for n, t in enumerate(izip(*iterables), start))

DATA_FILE_PATTERN = 'data*.dat'
MIN_DATA_FILES = 2

with multi_file_manager(iglob(DATA_FILE_PATTERN)) as datfiles:
    num_files = len(datfiles)
    if num_files < MIN_DATA_FILES:
        print('Less than {} .dat files were found to process, '
              'terminating.'.format(MIN_DATA_FILES))
        exit(1)

    # determine number of rows and cols from first file
    temp = [line.split() for line in datfiles[0]]
    num_rows = len(temp)
    num_cols = len(temp[0])
    datfiles[0].seek(0)  # rewind first file
    del temp  # no longer needed
    print '{} .dat files found, each must have {} rows x {} cols\n'.format(
           num_files, num_rows, num_cols)

    means = []
    std_devs = []
    divisor = float(num_files-1)  # Bessel's correction for sample standard dev
    generators = [read_values(file) for file in datfiles]
    for _ in xrange(num_rows):  # main processing loop
        for _ in xrange(num_cols):
            # create a sequence of next cell values from each file
            values = tuple(next(g) for g in generators)
            mean = float(sum(values)) / num_files
            means.append(mean)
            means_diff_sq = ((value-mean)**2 for value in values)
            std_dev = sqrt(sum(means_diff_sq) / divisor)
            std_devs.append(std_dev)

print 'Average and (standard deviation) of values:'
with open('means.txt', 'wt') as averages:
    for i, mean, std_dev in multi_enumerate(means, std_devs):
        print '{:.2f} ({:.2f})'.format(mean, std_dev),
        averages.write('{:.2f}'.format(mean))  # note std dev not written
        if i % num_cols != num_cols-1:  # not last column?
             averages.write(' ')  # delimiter between values on line
        else:
            print  # newline
            averages.write('\n')

I am not sure which aspect of the process is giving you the problem, but I will just answer specifically about getting the averages of all the dat files.

Assuming a data structure like this:

72 12 94 79 76  5 30 98 97 48 
79 95 63 74 70 18 92 20 32 50 
77 88 60 98 19 17 14 66 80 24 
...

Getting averages of the files:

import glob
import itertools

avgs = []

for datpath in glob.iglob("*.dat"):
    with open(datpath, 'r') as f:
        str_nums = itertools.chain.from_iterable(i.strip().split() for i in f)
        nums = map(int, str_nums)
        avg = sum(nums) / len(nums)
        avgs.append(avg)

print avgs

It loops over each .dat file, reads and joins the lines. Converts them to int (could be float if you want) and appends the avg.

If these files are enormous and you are concerned about the amount of memory when reading them in, you could more explicitly loop over each line and only keep counter, the way your original example was doing:

for datpath in glob.iglob("*.dat"):
    with open(datpath, 'r') as f:
        count = 0
        total = 0
        for line in f:
            nums = [int(i) for i in line.strip().split()]
            count += len(nums)
            total += sum(nums)
        avgs.append(total / count)
  • Note: I am not handling exceptional cases, such as the file being empty and producing a Divide By Zero situation.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM