简体   繁体   中英

average a column of a .csv file that includes floating point numbers with bash or perl

I have a few thousand files with data as such:

bash$ cat somefile0001.csv
col1;col2;col3; ..... ;col10
2.34;0.19;6.40; ..... ;4.20
3.8;2.45;2.20; ..... ;5.09E+003   

Basically, it's a 10x301 feild .csv file that includes a header file at the top deliminated by semi-colons ( didn't include the hole thing for brevity ).

So My goal is to change the scientific notation to decimal numbers average all the columns together, and then out put the column header with the column average to a new csv file, and then to this to thousands of files.

I already have working code to parse through all the files, I just can't seem to get the part to get the averaging to work

 #!/bin/bash
 filename=csvfile.csv
 i=1
      runningsum=0
      echo ""> $filename.tmp.$i
      tmptrnfrm=$(cut -f$i -d ';' $filename)
      tmpfilehold=$filename.tmp.$i
      echo "$tmptrnfrm" >> $tmpfilehold
      trnsfrmcount=0

      for j in $(cat $tmpfilehold)
      do
           if [[ $trnsfrmcount = 0 ]]]
           then
                echo -n "Iteration $trnsfrmcount:"
                echo "$j" #>> $tmpfilehold
                trnsfrmcount=$[$trnsfrmcount+1]
           elif [[ $trnsfrmcount < 301 ]]
           then
                if [[ $(echo $j | sed 's/[0-9].[0-9][0-9]E+[0-9]/arbitrarystring/' ) == arbitrarystring ]]
                then
                     tempj=$(printf "%0f" $j)
                     runningsum=$(echo '$runningsum + $tempj' | bc)
                     echo "$j" #>> tmpfilehold
                     trnsfrmcount=$[$trnsfrmcount+1]
                else
                     echo "preruns: $runningsum"
                     runningsum=$(echo '$runningsum + $j' | bc)
                     echo "$j," #>> $tmpfilehold
                     echo "the running sum is: $runningsum"
                     trnsfrmcount=$[$trnsfrmcount+1]
                fi
           fi
      done
 totalz=$(echo '$runningsum / 300' | bc)
 echo "here is the total"
 echo "$totalz"

 exit 0

I know it's kinda messy, I put a whole lot of extra strings to stdout to see what was happening while running. I would like to do this in perl, but I am just learning and know that this can be done with bash, and also I do not have access to the CSV module and no way to install it (otherwise it might be really easy).

Any help is greatly appreciated.

Here's a basic perl script that should do what you want. I haven't tested it.

#!/usr/bin/perl 
use strict;
use warnings;

my $infile = shift;
my $outfile = shift || $infile . ".new";

my $header = "";
my $count  = 0;
my @sums   = ();
my @means  = ();

open my $fin, '<', $infile or die $!;

$header = <$fin>;
@sums = map { 0 } split ";", $header;    # to initialize @sums;

while ( my $line = <$fin> ) {
    chomp $line;

    my @fields = split ";", $line;
    for ( my $i = 0 ; $i < scalar @fields ; $i++ ) {

        # use sprintf to convert to decimal notation
        # if we think we are using scientific notation
        if ( $fields[$i] =~ m/E/i ) {
            $sums[$i] += sprintf( "%.2f", $fields[$i] );
        } else {
            $sums[$i] += $fields[$i];
        }
    }

    $count++;
}

close $fin;

exit 1 if $count == 0;

# calculate averages
@means = map { sprintf( "%.2f", $_ / $count ) } @sums;

# intentionally left out writing to a file
print $header;
print join( ";", @means ) . "\n";

Tabulator is a set of unix command line tools to work with delimited files that have header lines. Here is an example to compute the average of the first three columns:

tblred -d';' -su -c'avg1_col=avg(col1),avg_col2=avg(col2)' somefile00001.csv

produces

avg1_col;avg_col2
3.07;1.32

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM