简体   繁体   中英

Use bash to cluster based on one column of a line

The input is as below

A  20  240
A  15  150
B  65  210
B  80  300
C  90  400
C  34  320

For each category (labelled as A,B,C..in the 1st column), I'd like to find the minimum as well as maximum numbers (as biggest range). So expect to see:

A  15  240
B  65  300
C  34  400

So how could I do using bash?

Using awk:

awk '
    !($1 in min) { min[$1] = $2; max[$1] = $3; next }
    {
        min[$1] = ( $2 < min[$1] ? $2 : min[$1] )
        max[$1] = ( $3 > max[$1] ? $3 : max[$1] )
    } 
    END {
        for(x in min) print x, min[x], max[x]
}' file
A 15 240
B 65 300
C 34 400

We iterate each line and assign min and max values to a map that has first column as the key. In the END block we iterate the hash and print out the key and values from both maps.

I tried to make an other solution (as a workaround) of the side affect of the unset variables in awk. (May be this is a little bit more optimized.)

cat min_max

#!/bin/bash
awk '
    NF!=3 || $2 $3 ~ "[^0-9-]" {next;}           # short filter
    min[$1]=="" {min[$1]=$2; max[$1]=$3; next;}  # first occur a given ID--> set min&max,read nxt ln
    min[$1]>$2  {min[$1]=$2;}                    # other occur IDs--> refresh min if required
    max[$1]<$3  {max[$1]=$3;}                    # refreshing max if required 
    END {for(x in min)printf("%-2s %5d %5d\n", x, min[x], max[x]);}
' $1

cat in4

A  20  240
B  65  210
C  90  400
A  15  150
C  34  320
E  -30  -20
D   0  100
B  80  300
D  10   90
E  -20 -10

./min_max in4

A     15   240
B     65   300
C     34   400
D      0   100
E    -30   -10

This bash code produces the same.

cat min_max2

#!/bin/bash
(($#!=1))&& { echo "Usage $0 inpfile"; exit 1; }
declare -A min max                                   # define associative arrays
while read id mn mx; do
   [[ ${min[$id]+any} == "" ]] && { min[$id]=$mn; max[$id]=$mx; continue; } # parameter extension
   (( ${min[$id]} > $mn )) && min[$id]=$mn
   (( ${max[$id]} < $mx )) && max[$id]=$mx
done <$1
for i in ${!min[@]}; do printf "%-2s %5d %5d\n" $i ${min[$i]} ${max[$i]}; done

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM