简体   繁体   中英

Count occurrences in a column and get the whole line

Trio type1 Chr Pos Allele rsID Gene
Sample11 type1 1 11105106 C/T rs2273345 MASP2
Sample11 type1 1 31342388 A/C/* 1:31342388 SDC3
Sample11 type1 1 33402334 GA/G-/++A rs36040674 RNF19B
Sample11 type1 1 38078171 G/+GT/+GTGT rs139353088 RSPO1
Sample11 type1 1 47074774 TCATGGTCTGATGGTCC/T----------------/ACATGGTCTGATGGTCC rs4275405 MOB3C
Sample11 type1 1 50883804 CTT/C--/CT- 1:50883804 DMRTA2
Sample11 type1 1 52947350 TA/++A/T- 1:52947350 ZCCHC11
Sample11 type1 1 84956161 CT/C-/++T rs556742567 RPF1
Sample11 type1 1 114940632 CAA/C--/CA- rs78184484 TRIM33

I know how to count the occurrence of column rsID. Here is the command I learn from @glenn jackman, which I can have the count of each rsID.

awk '{count[$7]++} END {for (word in count) print word, count[word]}' Nofilter.txt

I would like to grep the whole line which the rsID is recurrent.

grep if count[word]>3 

How should I modify the command based on the current one ?

Assuming:

  • You are using gawk, and
  • Out-of-order output is OK, and
  • Your input is named data.txt

Solutions:

  • Since gawk 4.0.0:

     awk '{a[$7]++;b[$7][c++]=$0}END{for(x in a)if(a[x]>3)for(y in b[x])print(b[x][y])}' data.txt 
  • Before gawk 4.0.0:

     awk '($7 in a){b[$7]=ORS}{c[$7]++;a[$7]=a[$7] b[$7] $0}END{for(x in c)if(c[x]>2)print(a[x])}' data.txt 

Why not just pipe it to awk to check if the occurrences are greater than 3?

awk '{count[$7]++} END {for (word in count) print word, count[word]}' Nofilter.txt | awk '$2 > 3'

This will check the output and tell you which lines had occurrences above your chosen cutoff.

if your awk doesn't support multi dimension arrays you can try this

$ awk '{k=$7; c[k]++; a[k]=(k in a)?a[k] ORS $0:$0} 
    END{for(k in c) if(c[k]>3) print a[k]}' file

Explanation

k=$7 set key as field 7
c[k]++ increment counter for key
a[k]=(k in a)?a[k] ORS $0:$0 append rows with record separator in between (join), special care for the first time !(k in a) , since a[k]=a[k] ORS $0 will start with extra record separator.
END{... when done, print all satisfying condition.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM