I have list of ids matched with key values of second column, I want to remove the duplicates and retain the corresponding values as comma or colon separated as shown in the out put format Input file
TRINITY_DN728479_c0_g1_i1 GO:0003674
TRINITY_DN728479_c0_g1_i1 GO:0003824
TRINITY_DN728479_c0_g1_i1 GO:0003887
TRINITY_DN728480_c0_g1_i1 GO:0003891
TRINITY_DN728480_c0_g1_i1 GO:0003892
I want the output
TRINITY_DN728479_c0_g1_i1 GO:0003674, GO:0003824, GO:0003887
TRINITY_DN728480_c0_g1_i1 GO:0003891,GO:0003892
I have tried awk but it not working out
awk -vORS=, '{ print $2 }' Gene.GO | sed 's/,$/\n/'
1st solution: With your shown samples, please try following awk
code. In case your 1st field is NOT sorted then use sort
with awk
code.
sort -t_ -k1 -k2 Input_file |
awk '
BEGIN{ OFS="\t" }
prev!=$1 && prev{
print prev,value
value=""
}
{
value=($1 in value ? value[$1] s1: "")$2
prev=$1
}
END{
if(prev && value){
print prev,value
}
}
'
2nd solution: Only awk
solution, this will give you output in same order in which 1st field is coming in Input_file.
awk '
BEGIN{ s1=","; OFS="\t" }
!arr1[$1]++{
arr2[++count]=$1
}
{
value[$1]=($1 in value ? value[$1] s1: "")$2
}
END{
for(i=1;i<=count;i++){
print arr2[i],value[arr2[i]]
}
}
' Input_file
3rd solution: In case you are not worried of order of 1st field in output then try following.
awk '
BEGIN{ s1=",";OFS="\t" }
{
value[$1]=($1 in value ? value[$1] s1: "")$2
}
END{
for(i in value){
print i, value[i]
}
}
' Input_file
If the input is 2 columns and already grouped by column 1
awk '
{
printf "%s", ($1==p ? "," $2 : ors $0)
ors = ORS
p = $1
} END {printf "%s", ors}' file
With datamash :
$ datamash -W -g1 collapse 2 <ip.txt
TRINITY_DN728479_c0_g1_i1 GO:0003674,GO:0003824,GO:0003887
TRINITY_DN728480_c0_g1_i1 GO:0003891,GO:0003892
-W
use space/tab as field separator -g1
group by column 1
collapse 2
to collect all values in column 2
based on column 1
key If input is not sorted, use -s
option or pipe the input from sort
command. Output field delimiter is tab here, you can change using --output-delimiter
option.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.