简体   繁体   English

对.csv文件中包含bash或perl浮点数的列进行平均

[英]average a column of a .csv file that includes floating point numbers with bash or perl

I have a few thousand files with data as such: 我有几千个带有这样数据的文件:

bash$ cat somefile0001.csv
col1;col2;col3; ..... ;col10
2.34;0.19;6.40; ..... ;4.20
3.8;2.45;2.20; ..... ;5.09E+003   

Basically, it's a 10x301 feild .csv file that includes a header file at the top deliminated by semi-colons ( didn't include the hole thing for brevity ). 基本上,这是一个10x301的.csv文件,该文件的顶部包含一个标头文件,并以分号分隔(为简洁起见,它不包含漏洞)。

So My goal is to change the scientific notation to decimal numbers average all the columns together, and then out put the column header with the column average to a new csv file, and then to this to thousands of files. 因此,我的目标是将科学计数法更改为十进制数字,将所有列的平均值平均在一起,然后将带有列平均值的列标题放到一个新的csv文件中,然后再放到成千上万个文件中。

I already have working code to parse through all the files, I just can't seem to get the part to get the averaging to work 我已经有了可以解析所有文件的有效代码,但似乎无法获得取平均值的功能

 #!/bin/bash
 filename=csvfile.csv
 i=1
      runningsum=0
      echo ""> $filename.tmp.$i
      tmptrnfrm=$(cut -f$i -d ';' $filename)
      tmpfilehold=$filename.tmp.$i
      echo "$tmptrnfrm" >> $tmpfilehold
      trnsfrmcount=0

      for j in $(cat $tmpfilehold)
      do
           if [[ $trnsfrmcount = 0 ]]]
           then
                echo -n "Iteration $trnsfrmcount:"
                echo "$j" #>> $tmpfilehold
                trnsfrmcount=$[$trnsfrmcount+1]
           elif [[ $trnsfrmcount < 301 ]]
           then
                if [[ $(echo $j | sed 's/[0-9].[0-9][0-9]E+[0-9]/arbitrarystring/' ) == arbitrarystring ]]
                then
                     tempj=$(printf "%0f" $j)
                     runningsum=$(echo '$runningsum + $tempj' | bc)
                     echo "$j" #>> tmpfilehold
                     trnsfrmcount=$[$trnsfrmcount+1]
                else
                     echo "preruns: $runningsum"
                     runningsum=$(echo '$runningsum + $j' | bc)
                     echo "$j," #>> $tmpfilehold
                     echo "the running sum is: $runningsum"
                     trnsfrmcount=$[$trnsfrmcount+1]
                fi
           fi
      done
 totalz=$(echo '$runningsum / 300' | bc)
 echo "here is the total"
 echo "$totalz"

 exit 0

I know it's kinda messy, I put a whole lot of extra strings to stdout to see what was happening while running. 我知道这有点混乱,我在stdout中放入了很多额外的字符串,以查看运行时发生了什么。 I would like to do this in perl, but I am just learning and know that this can be done with bash, and also I do not have access to the CSV module and no way to install it (otherwise it might be really easy). 我想在perl中执行此操作,但是我只是在学习,并且知道可以使用bash来完成此操作,而且我没有访问CSV模块的权限,也无法安装它(否则可能真的很容易)。

Any help is greatly appreciated. 任何帮助是极大的赞赏。

Here's a basic perl script that should do what you want. 这是一个基本的perl脚本,可以执行您想要的操作。 I haven't tested it. 我还没有测试。

#!/usr/bin/perl 
use strict;
use warnings;

my $infile = shift;
my $outfile = shift || $infile . ".new";

my $header = "";
my $count  = 0;
my @sums   = ();
my @means  = ();

open my $fin, '<', $infile or die $!;

$header = <$fin>;
@sums = map { 0 } split ";", $header;    # to initialize @sums;

while ( my $line = <$fin> ) {
    chomp $line;

    my @fields = split ";", $line;
    for ( my $i = 0 ; $i < scalar @fields ; $i++ ) {

        # use sprintf to convert to decimal notation
        # if we think we are using scientific notation
        if ( $fields[$i] =~ m/E/i ) {
            $sums[$i] += sprintf( "%.2f", $fields[$i] );
        } else {
            $sums[$i] += $fields[$i];
        }
    }

    $count++;
}

close $fin;

exit 1 if $count == 0;

# calculate averages
@means = map { sprintf( "%.2f", $_ / $count ) } @sums;

# intentionally left out writing to a file
print $header;
print join( ";", @means ) . "\n";

Tabulator is a set of unix command line tools to work with delimited files that have header lines. 制表符是一组UNIX命令行工具,用于处理带有标题行的定界文件。 Here is an example to compute the average of the first three columns: 这是一个计算前三列平均值的示例:

tblred -d';' -su -c'avg1_col=avg(col1),avg_col2=avg(col2)' somefile00001.csv

produces 产生

avg1_col;avg_col2
3.07;1.32

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM