简体   繁体   中英

How to extract the ids and keys using linux command?

I have list of ids matched with key values of second column, I want to remove the duplicates and retain the corresponding values as comma or colon separated as shown in the out put format Input file

TRINITY_DN728479_c0_g1_i1   GO:0003674
TRINITY_DN728479_c0_g1_i1   GO:0003824
TRINITY_DN728479_c0_g1_i1   GO:0003887
TRINITY_DN728480_c0_g1_i1   GO:0003891
TRINITY_DN728480_c0_g1_i1   GO:0003892

I want the output

TRINITY_DN728479_c0_g1_i1        GO:0003674, GO:0003824, GO:0003887
TRINITY_DN728480_c0_g1_i1        GO:0003891,GO:0003892

I have tried awk but it not working out

awk -vORS=, '{ print $2 }' Gene.GO | sed 's/,$/\n/'

1st solution: With your shown samples, please try following awk code. In case your 1st field is NOT sorted then use sort with awk code.

sort -t_ -k1 -k2 Input_file | 
awk '
  BEGIN{ OFS="\t" }
  prev!=$1 && prev{
    print prev,value
    value=""
  }
  {
    value=($1 in value ? value[$1] s1: "")$2
    prev=$1
  }
  END{
    if(prev && value){
      print prev,value
    }
  }
'

2nd solution: Only awk solution, this will give you output in same order in which 1st field is coming in Input_file.

awk '
BEGIN{ s1=","; OFS="\t" }
!arr1[$1]++{
  arr2[++count]=$1
}
{
  value[$1]=($1 in value ? value[$1] s1: "")$2
}
END{
  for(i=1;i<=count;i++){
    print arr2[i],value[arr2[i]]
  }
}
' Input_file

3rd solution: In case you are not worried of order of 1st field in output then try following.

awk '
BEGIN{ s1=",";OFS="\t" }
{
  value[$1]=($1 in value ? value[$1] s1: "")$2
}
END{
  for(i in value){
    print i, value[i]
  }
}
'  Input_file

If the input is 2 columns and already grouped by column 1

awk '
{
  printf "%s", ($1==p ? "," $2 : ors $0)
  ors = ORS
  p = $1
} END {printf "%s", ors}' file

With datamash :

$ datamash -W -g1 collapse 2 <ip.txt 
TRINITY_DN728479_c0_g1_i1   GO:0003674,GO:0003824,GO:0003887
TRINITY_DN728480_c0_g1_i1   GO:0003891,GO:0003892
  • -W use space/tab as field separator
  • -g1 group by column 1
  • collapse 2 to collect all values in column 2 based on column 1 key

If input is not sorted, use -s option or pipe the input from sort command. Output field delimiter is tab here, you can change using --output-delimiter option.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM