[英]average a column of a .csv file that includes floating point numbers with bash or perl
I have a few thousand files with data as such: 我有几千个带有这样数据的文件:
bash$ cat somefile0001.csv
col1;col2;col3; ..... ;col10
2.34;0.19;6.40; ..... ;4.20
3.8;2.45;2.20; ..... ;5.09E+003
Basically, it's a 10x301 feild .csv file that includes a header file at the top deliminated by semi-colons ( didn't include the hole thing for brevity ). 基本上,这是一个10x301的.csv文件,该文件的顶部包含一个标头文件,并以分号分隔(为简洁起见,它不包含漏洞)。
So My goal is to change the scientific notation to decimal numbers average all the columns together, and then out put the column header with the column average to a new csv file, and then to this to thousands of files. 因此,我的目标是将科学计数法更改为十进制数字,将所有列的平均值平均在一起,然后将带有列平均值的列标题放到一个新的csv文件中,然后再放到成千上万个文件中。
I already have working code to parse through all the files, I just can't seem to get the part to get the averaging to work 我已经有了可以解析所有文件的有效代码,但似乎无法获得取平均值的功能
#!/bin/bash
filename=csvfile.csv
i=1
runningsum=0
echo ""> $filename.tmp.$i
tmptrnfrm=$(cut -f$i -d ';' $filename)
tmpfilehold=$filename.tmp.$i
echo "$tmptrnfrm" >> $tmpfilehold
trnsfrmcount=0
for j in $(cat $tmpfilehold)
do
if [[ $trnsfrmcount = 0 ]]]
then
echo -n "Iteration $trnsfrmcount:"
echo "$j" #>> $tmpfilehold
trnsfrmcount=$[$trnsfrmcount+1]
elif [[ $trnsfrmcount < 301 ]]
then
if [[ $(echo $j | sed 's/[0-9].[0-9][0-9]E+[0-9]/arbitrarystring/' ) == arbitrarystring ]]
then
tempj=$(printf "%0f" $j)
runningsum=$(echo '$runningsum + $tempj' | bc)
echo "$j" #>> tmpfilehold
trnsfrmcount=$[$trnsfrmcount+1]
else
echo "preruns: $runningsum"
runningsum=$(echo '$runningsum + $j' | bc)
echo "$j," #>> $tmpfilehold
echo "the running sum is: $runningsum"
trnsfrmcount=$[$trnsfrmcount+1]
fi
fi
done
totalz=$(echo '$runningsum / 300' | bc)
echo "here is the total"
echo "$totalz"
exit 0
I know it's kinda messy, I put a whole lot of extra strings to stdout to see what was happening while running. 我知道这有点混乱,我在stdout中放入了很多额外的字符串,以查看运行时发生了什么。 I would like to do this in perl, but I am just learning and know that this can be done with bash, and also I do not have access to the CSV module and no way to install it (otherwise it might be really easy). 我想在perl中执行此操作,但是我只是在学习,并且知道可以使用bash来完成此操作,而且我没有访问CSV模块的权限,也无法安装它(否则可能真的很容易)。
Any help is greatly appreciated. 任何帮助是极大的赞赏。
Here's a basic perl script that should do what you want. 这是一个基本的perl脚本,可以执行您想要的操作。 I haven't tested it. 我还没有测试。
#!/usr/bin/perl
use strict;
use warnings;
my $infile = shift;
my $outfile = shift || $infile . ".new";
my $header = "";
my $count = 0;
my @sums = ();
my @means = ();
open my $fin, '<', $infile or die $!;
$header = <$fin>;
@sums = map { 0 } split ";", $header; # to initialize @sums;
while ( my $line = <$fin> ) {
chomp $line;
my @fields = split ";", $line;
for ( my $i = 0 ; $i < scalar @fields ; $i++ ) {
# use sprintf to convert to decimal notation
# if we think we are using scientific notation
if ( $fields[$i] =~ m/E/i ) {
$sums[$i] += sprintf( "%.2f", $fields[$i] );
} else {
$sums[$i] += $fields[$i];
}
}
$count++;
}
close $fin;
exit 1 if $count == 0;
# calculate averages
@means = map { sprintf( "%.2f", $_ / $count ) } @sums;
# intentionally left out writing to a file
print $header;
print join( ";", @means ) . "\n";
Tabulator is a set of unix command line tools to work with delimited files that have header lines. 制表符是一组UNIX命令行工具,用于处理带有标题行的定界文件。 Here is an example to compute the average of the first three columns: 这是一个计算前三列平均值的示例:
tblred -d';' -su -c'avg1_col=avg(col1),avg_col2=avg(col2)' somefile00001.csv
produces 产生
avg1_col;avg_col2
3.07;1.32
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.