簡體   English   中英

在bash腳本中聚合CSV文件

[英]Aggregating csv file in bash script

我有多行的csv文件。 每行具有相同的列數。 我需要做的是將這些行按一些指定的列進行分組,並匯總來自其他列的數據。 輸入文件示例:

proces1,pathA,5-May-2011,10-Sep-2017,5
proces2,pathB,6-Jun-2014,7-Jun-2015,2
proces1,pathB,6-Jun-2017,7-Jun-2017,1
proces1,pathA,11-Sep-2017,15-Oct-2017,2

對於上面的示例,我需要按前兩列對行進行分組。 從第3列中,我需要選擇最小值,第4列中的最大值,第5列應具有總和。 因此,對於這樣的輸入文件,我需要輸出:

proces1,pathA,5-May-2011,15-Oct-2017,7
proces1,pathB,6-Jun-2017,7-Jun-2017,1
proces2,pathB,6-Jun-2014,7-Jun-2015,2

我需要在bash中處理它(我也可以使用awk或sed)。

使用bash和sort:

#!/bin/bash

# create associative arrays 
declare -A month2num=([Jan]=1 [Feb]=2 [Mar]=3 [Apr]=4 [May]=5 [Jun]=6 [Jul]=7 [Aug]=8 [Sep]=9 [Oct]=10 [Nov]=11 [Dec]=12])
declare -A p ds de  # date start and date end
declare -A -i sum   # set integer attribute 

# function to convert 5-Jun-2011 to 20110605
date2num() { local d m y; IFS="-" read -r d m y <<< "$1"; printf "%d%.2d%.2d\n" $y ${month2num[$m]} $d; }

# read all columns to variables p1 p2 d1 d2 s
while IFS="," read -r p1 p2 d1 d2 s; do

  # if associative array is still empty for this entry
  # fill with current strings/value
  if [[ -z ${p[$p1,$p2]} ]]; then
    p[$p1,$p2]="$p1,$p2"
    ds[$p1,$p2]="$d1"
    de[$p1,$p2]="$d2"
    sum[$p1,$p2]="$s"
    continue
  fi

  # compare strings, set new strings and sum value
  if [[ ${p[$p1,$p2]} == "$p1,$p2" ]]; then
    [[ $(date2num "$d1") < $(date2num ${ds[$p1,$p2]}) ]] && ds[$p1,$p2]="$d1"
    [[ $(date2num "$d2") > $(date2num ${de[$p1,$p2]}) ]] && de[$p1,$p2]="$d2"
    sum[$p1,$p2]=sum[$p1,$p2]+s
  fi

done < file

# print content of all associative arrays with key vom associative array p
for i in "${!p[@]}"; do echo "${p[$i]},${ds[$i]},${de[$i]},${sum[$i]}"; done

用法: ./script.sh | sort ./script.sh | sort

輸出到標准輸出:

proces1,pathA,5-May-2011,15-Oct-2017,7
proces1,pathB,6-Jun-2017,7-Jun-2017,1
proces2,pathB,6-Jun-2014,7-Jun-2015,2

請參閱: help declarehelp read ,當然還有man bash

用awk +排序

awk -F',|-' '
BEGIN{
  A["Jan"]="01"
  A["Feb"]="02"
  A["Mar"]="03"
  A["Apr"]="04"
  A["May"]="05"
  A["Jun"]="06"
  A["July"]="07"
  A["Aug"]="08"
  A["Sep"]="09"
  A["Oct"]="10"
  A["Nov"]="11"
  A["Dec"]="12"
}
{
  B[$1","$2]=B[$1","$2]+$9
  z=sprintf( "%.2d",$3)
  y=sprintf("%s",$5 A[$4] z)
  if(!start[$1$2])
  {
    end[$1$2]=0
    start[$1$2]=99999999
  }
  if (y < start[$1$2])
  {
    start[$1$2]=y
    C[$1","$2]=$3"-"$4"-"$5
  }
  x=sprintf( "%.2d",$6)
  w=sprintf("%s",$8 A[$7] x)
  if(w > end[$1$2] )
  {
    end[$1$2]=w
    D[$1","$2]=$6"-"$7"-"$8
  }
}
END{
  for (i in B)print i "," C[i] "," D[i] "," B[i]
}
' infile | sort

擴展的GNU awk解決方案:

awk -F, 'function parse_date(d_str){ 
             split(d_str, d, "-"); 
             t = mktime(sprintf("%d %d %d 00 00 00", d[3], m[d[2]], d[1]));
             return t 
         }
         BEGIN{ m["Jan"]=1; m["Feb"]=2; m["Mar"]=3; m["Apr"]=4; m["May"]=5; m["Jun"]=6; 
                m["Jul"]=7; m["Aug"]=8; m["Sep"]=9; m["Oct"]=10; m["Nov"]=11; m["Dec"]=12; 
         }
         { 
             k=$1 SUBSEP $2; 
             if (k in a){ 
                 if (parse_date(a[k]["min"]) > parse_date($3)) { a[k]["min"]=$3 } 
                 if (parse_date(a[k]["max"]) < parse_date($4)) { a[k]["max"]=$4 } 
             } else { 
                 a[k]["min"]=$3; a[k]["max"]=$4 
             } 
             a[k]["sum"]+= $5 
         }
         END{ 
             for (i in a) { 
                 split(i, j, SUBSEP); 
                 print j[1], j[2], a[i]["min"], a[i]["max"], a[i]["sum"] 
             } 
         }' OFS=',' file

輸出:

proces1,pathA,5-May-2011,15-Oct-2017,7
proces1,pathB,6-Jun-2017,7-Jun-2017,1
proces2,pathB,6-Jun-2014,7-Jun-2015,2

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM