Python - Calculate average for every column in a csv file

Question

I'm new in Python and I'm trying to get the average of every (column or row) of a csv file for then select the values that are higher than the double of the average of its column (o row). My file have hundreds of columns, and have float values like these:

845.123,452.234,653.23,...
432.123,213.452.421.532,...
743.234,532,432.423,...

I've tried several changes to my code to get the average for every column (separately), but at the moment my code is like this one:

def AverageColumn (c):
    f=open(csv,"r")
    average=0
    Sum=0
    column=len(f)
    for i in range(0,column):
        for n in i.split(','):
            n=float(n)
            Sum += n
        average = Sum / len(column)
    return 'The average is:', average

    f.close()


csv="MDT25.csv"
print AverageColumn(csv)

But I always get a error like " f has no len()" or "'int' object is not iterable"...

I'd really appreciate if someone show me how to get the average for every column (or row, as you want), and then select the values that are higher than the double of the average of its column (or row). I'd rather without importing modules as csv, but as you prefer. Thanks!

Answer 1

Here's a clean up of your function, but it probably doesn't do what you want it to do. Currently, it is getting the average of all values in all columns:

def average_column (csv):
    f = open(csv,"r")
    average = 0
    Sum = 0
    row_count = 0
    for row in f:
        for column in row.split(','):
            n=float(column)
            Sum += n
        row_count += 1
    average = Sum / len(column)
    f.close()
    return 'The average is:', average

I would use the csv module (which makes csv parsing easier), with a Counter object to manage the column totals and a context manager to open the file (no need for a close() ):

import csv
from collections import Counter

def average_column (csv_filepath):
    column_totals = Counter()
    with open(csv_filepath,"rb") as f:
        reader = csv.reader(f)
        row_count = 0.0
        for row in reader:
            for column_idx, column_value in enumerate(row):
                try:
                    n = float(column_value)
                    column_totals[column_idx] += n
                except ValueError:
                    print "Error -- ({}) Column({}) could not be converted to float!".format(column_value, column_idx)                    
            row_count += 1.0            

    # row_count is now 1 too many so decrement it back down
    row_count -= 1.0

    # make sure column index keys are in order
    column_indexes = column_totals.keys()
    column_indexes.sort()

    # calculate per column averages using a list comprehension
    averages = [column_totals[idx]/row_count for idx in column_indexes]
    return averages

Answer 2

First of all, as people say - CSV format looks simple, but it can be quite nontrivial, especially once strings enter play. monkut already gave you two solutions, the cleaned-up version of your code, and one more that uses CSV library. I'll give yet another option: no libraries, but plenty of idiomatic code to chew on, which gives you averages for all columns at once.

def get_averages(csv):
    column_sums = None
    with open(csv) as file:
        lines = file.readlines()
        rows_of_numbers = [map(float, line.split(',')) for line in lines]
        sums = map(sum, zip(*rows_of_numbers))
        averages = [sum_item / len(lines) for sum_item in sums]
        return averages

Things to note: In your code, f is a file object. You try to close it after you have already returned the value. This code will never be reached: nothing executes after a return has been processed, unless you have a try...finally construct, or with construct (like I am using - which will automatically close the stream).

map(f, l) , or equivalent [f(x) for x in l] , creates a new list whose elements are obtained by applying function f on each element on l .

f(*l) will "unpack" the list l before function invocation, giving to function f each element as a separate argument.

Answer 3

If you want to do it without stdlib modules for some reason:

with open('path/to/csv') as infile:
    columns = list(map(float,next(infile).split(',')))
    for how_many_entries, line in enumerate(infile,start=2):
        for (idx,running_avg), new_data in zip(enumerate(columns), line.split(',')):
            columns[idx] += (float(new_data) - running_avg)/how_many_entries

Answer 4

I suggest breaking this into several smaller steps:

Read the CSV file into a 2D list or 2D array.
Calculate the averages of each column.

Each of these steps can be implemented as two separate functions. (In a realistic situation where the CSV file is large, reading the complete file into memory might be prohibitive due to space constraints. However, for a learning exercise, this is a great way to gain an understanding of writing your own functions.)

Answer 5

I hope this helps you out......Some help....here is what I would do - which is use numpy:

    # ==========================
    import numpy as np
    import csv as csv

    #  Assume that you have 2 columns and a header-row: The Columns are (1) 
    #  question # ...1; (2) question 2
    # ========================================

    readdata = csv.reader(open('filename.csv', 'r'))  #this is the file you 
    # ....will write your original file to....============
    data = []
    for row in readdata:
    data.append(row)
    Header = data[0]
    data.pop(0)
    q1 = []
    q2 = []
    # ========================================

    for i in range(len(data)):
        q1.append(int(data[i][1]))
        q2.append(int(data[i][2]))
    # ========================================
    # ========================================
    # === Means/Variance - Work-up Section ===
    # ========================================
    print ('Mean - Question-1:            ', (np.mean(q1)))
    print ('Variance,Question-1:          ', (np.var(q1)))
    print ('==============================================')
    print ('Mean - Question-2:            ', (np.mean(q2)))
    print ('Variance,Question-2:          ', (np.var(q2)))

Answer 6

This definitely worked for me!

import numpy as np
import csv

readdata = csv.reader(open('C:\\...\\your_file_name.csv', 'r'))
data = []

for row in readdata:
  data.append(row)

#incase you have a header/title in the first row of your csv file, do the next line else skip it
data.pop(0) 

q1 = []  

for i in range(len(data)):
  q1.append(int(data[i][your_column_number]))

print ('Mean of your_column_number :            ', (np.mean(q1)))

Answer 7

import csv
from statistics import mean
with open(r'path/to/csv','r') as f:
    reader = csv.reader(f)
    print(mean([float(i[2]) for i in reader if i[2].isnumeric()]))

replace '2' with the index of the column you'd wish to calculate

Python - Calculate average for every column in a csv file

Question

7 answers

solution1
4 ACCPTED 2014-09-01 00:42:37

solution2
2 2014-09-01 00:50:40

solution3
0 2014-09-01 00:47:40

solution4
0 2014-09-01 01:11:34

solution5
0 2018-04-07 18:02:32

solution6
0 2019-02-27 15:36:50

solution7
0 2022-08-25 08:01:57

Python - Calculate average for every column in a csv file

Question

7 answers

solution1 4 ACCPTED 2014-09-01 00:42:37

solution2 2 2014-09-01 00:50:40

solution3 0 2014-09-01 00:47:40

solution4 0 2014-09-01 01:11:34

solution5 0 2018-04-07 18:02:32

solution6 0 2019-02-27 15:36:50

solution7 0 2022-08-25 08:01:57

solution1
4 ACCPTED 2014-09-01 00:42:37

solution2
2 2014-09-01 00:50:40

solution3
0 2014-09-01 00:47:40

solution4
0 2014-09-01 01:11:34

solution5
0 2018-04-07 18:02:32

solution6
0 2019-02-27 15:36:50

solution7
0 2022-08-25 08:01:57