简体   繁体   中英

How to find the average of values of all columns of various .csv files keeping only single header and the first label column the same using python?

So I have various .csv files in a directory of the same structure with first row as the header and first column as labels. Say file 1 is as below:

name,value1,value2,value3,value4,......
name1,10,20,0,0,...
name2,20,30,0,0,...
name3,30,40,0,0,...
name4,40,50,0,0,...
....

File2:

name,value1,value2,value3,value4,......
name1,20,30,0,0,...
name2,30,40,0,0,...
name3,40,50,0,0,...
name4,50,60,0,0,...
....

All the .csv files have the same structure with the same number of rows and columns.

What I want is something that looks like this:

name,value1,value2,value3,value4,......
name1,15,25,0,0,...
name2,25,35,0,0,...
name3,35,45,0,0,...
name4,45,55,0,0,...
....

Where the all the value columns in the last file will be the average of corresponding values in those columns of all the .csv files. So under value1 in the resulting file I should have (10+20+...+...)/n and so on.

The number of .csv files isn't fixed, so I think I'll need a loop.

How do I achieve this with a python script on a Linux machine.

With awk I'm doing this:

awk '
    BEGIN {FS=OFS=","}
    FNR==1 {header=$0}      # header line
    FNR>1 {
        sum[FNR,1] = $1     # names column
        for (j=2; j<=NF; j++) {
            sum[FNR,j] += $j
        }
    }
    END {
        print header
        files = ARGC - 1    # number of csv files
        for (i=2; i<=FNR; i++) {
            $1 = sum[i,1]   # another treatment for the 1st column
            for (j=2; j<=NF; j++) {
                $j = sum[i,j] / files
            }
            print
        }
    }' *.csv

But I realized that the column names may not be the same in every file. So say if name1 is present only in first two files & not in the third file then I've to display a message saying it is missing in the third file but still calculate the average from the other two files. I think using a dictionary and a counter will do it but I am not sure how to do it.

If you want to use only the standart libraries, here the example is:

import csv from statistics import mean

filename1 = 'f1.csv'
filename2 = 'f2.csv'
output = 'output.csv'

with open(filename1, 'r') as f1, open(filename2, 'r') as f2, open(output, 'r') as out:
    r1 = csv.reader(f1)
    r2 = csv.reader(f2)
    w = csv.writer(out)
    w.writerows(next(r1))
    next(r2)

    for line1, line2 in zip(r1, r2):
        w.writerows([line1[0]] + list(map(lambda a: (a[0]+a[1])//2, zip(line1[1:], line2[1:]))))

If you want to use pandas , here it is:

import pandas as pd

df1 = pd.read_csv('filename1.csv', index_col=0, header=0)
df2 = pd.read_csv('filename2.csv', index_col=0, header=0)

out = (df1 + df2) // 2

out.to_csv('output.csv')

Expanding on the pandas option that was posted by Mr Morgan , you could also use:

filename_list=['csv1.csv','csv2.csv']
dfs=[]
for fname in filename_list:
    dfs.append(pd.read_csv(fname,index_col=0))

averages = pd.concat([each.stack() for each in dfs],axis=1)\
             .apply(lambda x:x.mean(),axis=1)\
             .unstack()

averages.to_csv("csvAvg.csv")

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM