简体   繁体   中英

Using awk to get unique values from column 1, and sum corresponding values in column 2?

I have a CSV file in the following format, I was told at work this is a "map reduce problem" { Server1,33.23 Server2,43.46 Server3,64.34 Server4,56.89 Server2,33.24 Server1,21.40 Server2,33.46 }

It is several thousand lines long and there are around 80 server names which appear several times each in column 1, and column 2 is Mbs. For every occurance of a server name in column 1 add the corresponding value in column 2. So I am left with a new table with no duplicates in column 1 and just the total sum of Mbs from column 2.

So in case I was unclear - for every occurance of any unique value in column 1, add the corresponding values in column 2. And in the end I'd have.

Server1,TotalMbs Server2,TotalMbs Server3,TotalMbs

I know this can be done with awk but I can't figure out how, I think passing in the value in column 1 and then increment a count in column 2 and keep doing it line by line. It's quite tricky??? My long and inelegant solution will be to create a temp file for each server in a loop then just total column 2 for each file then rm the files at the end but I know it can be done in a one liner with awk.

The following awk script might help you,

$ awk -F'[ |,]'  '{for(i=1;i<=NF;i++)if($i ~ "Server")a[$i]+=$(i+1)}END{for(i in a)printf "%s,%s ",i,a[i];printf "\n"}' input_file
Server3,64.34 Server4,56.89 Server1,54.63 Server2,110.16

If ordered output is required, add BEGIN{PROCINFO["sorted_in"]="@ind_str_asc"} to the BIGIN block,

$ awk -F'[ |,]'  'BEGIN{PROCINFO["sorted_in"]="@ind_str_asc"}{for(i=1;i<=NF;i++)if($i ~ "Server")a[$i]+=$(i+1)}END{for(i in a)printf "%s,%s ",i,a[i];printf "\n"}' input_file
Server1,54.63 Server2,110.16 Server3,64.34 Server4,56.89

The oneliner could also be written like this:

awk -F'[ |,]' '{
    if($i ~ "Server")
      a[$i]+=$(i+1)
} END{
  for(i in a)
    printf "%s,%s ",i,a[i];
  printf "\n"
}' input_file

Brief explanation,

  1. Set " " and "," as the delimeters
  2. Scan each line, find "Server" in each column, and store the value of the next column into the corresponding key of a , ie a[$i]=$(i+1) , if we found.
awk -F',' '{ 
             servers[$1] += $;
           } 
           END {
             for (server in servers) { 
               printf("%s %f\n", server, servers[server]); 
             }
           }'

If you want to filter on specific servers, you can add a '//' match to the first block, to make it only execute on lines that match the condition.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM