How to get averages per column not row from a CSV file?

Question

I have 13 columns with 303 rows/lines I have divided the 303 rows between healthy patients and Ill patients I am now trying to get the averages for each column in CSV file for the healthy patients and ill patients to compare and contrast.The end example of the problem is this and the CSV file has numbers like the averages in this example with the exception of ?'s in missing data.

Please enter a training file name: train.csv
Total Lines Processed: 303
Total Healthy Count: 164
Total Ill Count: 139
Averages of Healthy Patients:
[52.59, 0.56, 2.79, 129.25, 242.64, 0.14, 0.84, 158.38, 0.14, 0.59, 1.41, 0.27, 3.77, 0.00]
Averages of Ill Patients:
[56.63, 0.82, 3.59, 134.57, 251.47, 0.16, 1.17, 139.26, 0.55, 1.57, 1.83, 1.13, 5.80, 2.04]
Seperation Values are:
[54.61, 0.69, 3.19, 131.91, 247.06, 0.15, 1.00, 148.82, 0.34, 1.08, 1.62, 0.70, 4.79, 1.02]

I still have a long way to go on my code, I'm just looking for a simplistic way to get the averages of the patients. My current method only gets column 13 but I need all 13 like above. Any help on which way I should try to go with solving this would be appreciated.

import csv
#turn csv files into a list of lists
with open('train.csv') as csvfile:
     reader = csv.reader(csvfile, delimiter=',')
     csv_data = list(reader)

i_list = []
for row in csv_data:
    if (row and int(row[13]) > 0):
        i_list.append(int(row[13]))
H_list = []
for row in csv_data:
    if (row and int(row[13]) <= 0):
        H_list.append(int(row[13]))

Icount = len(i_list)
IPavg = sum(i_list)/len(i_list)
Hcount = len(H_list)
HPavg = sum(H_list)/len(H_list)
file = open("train.csv")
numline = len(file.readlines())

print(numline)
print("Total amount of healthy patients " + str(Icount))
print("Total amount of ill patients " + str(Hcount))
print("Averages of healthy patients " + str(HPavg))
print("Averages of ill patients " + str(IPavg))

My only idea would be to do the same as I did to get the averages for row 13 but I don't know how I would keep the healthy patients separated from the Ill patients.

Answer 1

If you want averages for each column, then it would easiest to process all of them at once as you read the file — it's not that difficult. You didn't specify what version of Python you're using, but the following should work in both (although it could be optimized for one or the other).

import csv

NUMCOLS = 13

with open('train.csv') as csvfile:
    reader = csv.reader(csvfile, delimiter=',')
    # initialize totals
    Icount = 0
    Hcount = 0
    H_col_totals = [0.0 for _ in range(NUMCOLS)]  # init to floating pt value for Py 2
    I_col_totals = [0.0 for _ in range(NUMCOLS)]  # init to floating pt value for Py 2
    # read and process file
    for row in reader:
        if row:  # non-blank line?
            # update running total for each column
            row = list(map(int, row))
            for col in range(NUMCOLS):
                if row[col] > 0:
                    Icount += 1
                    I_col_totals[col] += row[col]
                else:
                    Hcount += 1
                    H_col_totals[col] += row[col]

# compute average of data in each column
if Hcount < 1:  # avoid dividing by zero
    HPavgs = [0.0 for _ in range(NUMCOLS)]
else:
    HPavgs = [H_col_totals[col]/Hcount for col in range(NUMCOLS)]

if Icount < 1:  # avoid dividing by zero
    IPavgs = [0.0 for _ in range(NUMCOLS)]
else:
    IPavgs = [I_col_totals[col]/Icount for col in range(NUMCOLS)]

print("Total number of healthy patients: {}".format(Hcount))
print("Total number of ill patients: {}".format(Icount))
print("Averages of healthy patients: " +
      ", ".join(format(HPavgs[col], ".2f") for col in range(NUMCOLS)))
print("Averages of ill patients: " +
      ", ".join(format(IPavgs[col], ".2f") for col in range(NUMCOLS)))

Answer 2

Why dont you use pandas module?

It would be lot easier to accomplish what you want.

In [42]: import pandas as pd

In [43]: import numpy as np

In [44]: df = pd.DataFrame(np.random.randn(10, 4))

In [45]: df
Out[45]:
          0         1         2         3
0  1.290657 -0.376132 -0.482188  1.117486
1 -0.620332 -0.247143  0.214548 -0.975472
2  1.803212 -0.073028  0.224965  0.069488
3 -0.249340  0.491075  0.083451  0.282813
4 -0.477317  0.059482  0.867047 -0.656830
5  0.117523  0.089099 -0.561758  0.459426
6 -0.173780 -0.066054 -0.943881 -0.301504
7  1.250235 -0.949350 -1.119425  1.054016
8  1.031764 -1.470245 -0.976696  0.579424
9  0.300025  1.141415  1.503518  1.418005

In [46]: df.mean()
Out[46]:
0    0.427265
1   -0.140088
2   -0.119042
3    0.304685
dtype: float64

In you case you can try:

In [47]: df = pd.read_csv('yourfile.csv')

How to get averages per column not row from a CSV file?

Question

2 answers

solution1
2 ACCPTED 2016-04-12 17:55:54

solution2
1 2016-04-12 17:17:04

How to get averages per column not row from a CSV file?

Question

2 answers

solution1 2 ACCPTED 2016-04-12 17:55:54

solution2 1 2016-04-12 17:17:04

solution1
2 ACCPTED 2016-04-12 17:55:54

solution2
1 2016-04-12 17:17:04