简体   繁体   中英

Match lines based on patterns and reformat file Bash/ Linux

I am looking preferably for a bash/Linux method for the problem below.

I have a text file ( input.txt ) that looks like so (and many many more lines):

TCCTCCGC+TAGTTAGG_Vel_24_CC_LlanR_34    CC_LlanR
GGAGTATG+TCTATTCG_Vel_24_CC_LlanR_22    CC_LlanR
TTGACTAG+TGGAGTAC_Vel_02_EN_DavaW_11    EN_DavaW
TCGAATAA+TGGTAATT_Vel_24_CC_LlanR_23    CC_LlanR
CTGCTGAA+CGTTGCGG_Vel_02_EN_DavaW_06    EN_DavaW
index_07_barcode_04_PA-17-ACW-04        17-ACW
index_09_barcode_05_PA-17-ACW-05        17-ACW
index_08_barcode_37_PA-21-YC-15         21-YC
index_09_barcode_04_PA-22-GB-10         22-GB
index_10_barcode_37_PA-28-CC-17         28-CC
index_11_barcode_29_PA-32-MW-07         32-MW
index_11_barcode_20_PA-32-MW-08         32-MW

I want to produce a file that looks like

CC_LlanR(TCCTCCGC+TAGTTAGG_Vel_24_CC_LlanR_34,GGAGTATG+TCTATTCG_Vel_24_CC_LlanR_22,TCGAATAA+TGGTAATT_Vel_24_CC_LlanR_23)
EN_DavaW(TTGACTAG+TGGAGTAC_Vel_02_EN_DavaW_11,CTGCTGAA+CGTTGCGG_Vel_02_EN_DavaW_06)
17-ACW(index_07_barcode_04_PA-17-ACW-04,index_09_barcode_05_PA-17-ACW-05)
21-YC(index_08_barcode_37_PA-21-YC-15)
22-GB(index_09_barcode_04_PA-22-GB-10)
28-CC(index_10_barcode_37_PA-28-CC-17)
32-MW(index_11_barcode_29_PA-32-MW-07,index_11_barcode_20_PA-32-MW-08)

I thought that I could do something along the lines of this.

cat input.txt | awk '{print $1}' | grep -e "CC_LlanR" | paste -sd',' > intermediate_file
cat input.txt | awk '{print $2"("}' something something??

But I only know how to grep one pattern at a time? Is there a way to find all the matching lines at once and output them in this format?

Thank you! (Happy Easter/ long weekend to all!)

With your shown samples please try following.

awk '
FNR==NR{
  arr[$2]=(arr[$2]?arr[$2]",":"")$1
  next
}
($2 in arr){
  print $2"("arr[$2]")"
  delete arr[$2]
}
' Input_file Input_file

2nd solution: Within a single read of Input_file try following.

awk '{arr[$2]=(arr[$2]?arr[$2]",":"")$1} END{for(i in arr){print i"("arr[i]")"}}' Input_file

Explanation(1st solution): Adding detailed explanation for 1st solution here.

awk '                      ##Starting awk program from here.
FNR==NR{                   ##Checking condition FNR==NR which will be TRUE when first time Input_file is being read.
  arr[$2]=(arr[$2]?arr[$2]",":"")$1 ##Creating array with index of 2nd field and keep adding its value with comma here.
  next                     ##next will skip all further statements from here.
}
($2 in arr){               ##Checking condition if 2nd field is present in arr then do following.
  print $2"("arr[$2]")"    ##Printing 2nd field ( arr[$2] ) here.
  delete arr[$2]           ##Deleteing arr value with 2nd field index here.
}
' Input_file Input_file    ##Mentioning Input_file names here.

Assuming your input is grouped by the $2 value as shown in your example (if it isn't then just run sort -k2,2 on your input first) using 1 pass and only storing one token at a time in memory and producing the output in the same order of $2 s as the input:

$ cat tst.awk
BEGIN { ORS="" }
$2 != prev {
    printf "%s%s(", ORS, $2
    ORS = ")\n"
    sep = ""
    prev = $2
}
{
    printf "%s%s", sep, $1
    sep = ","
}
END { print "" }

$ awk -f tst.awk input.txt
CC_LlanR(TCCTCCGC+TAGTTAGG_Vel_24_CC_LlanR_34,GGAGTATG+TCTATTCG_Vel_24_CC_LlanR_22)
EN_DavaW(TTGACTAG+TGGAGTAC_Vel_02_EN_DavaW_11)
CC_LlanR(TCGAATAA+TGGTAATT_Vel_24_CC_LlanR_23)
EN_DavaW(CTGCTGAA+CGTTGCGG_Vel_02_EN_DavaW_06)
17-ACW(index_07_barcode_04_PA-17-ACW-04,index_09_barcode_05_PA-17-ACW-05)
21-YC(index_08_barcode_37_PA-21-YC-15)
22-GB(index_09_barcode_04_PA-22-GB-10)
28-CC(index_10_barcode_37_PA-28-CC-17)
32-MW(index_11_barcode_29_PA-32-MW-07,index_11_barcode_20_PA-32-MW-08)

This might work for you (GNU sed):

sed -E 's/^(\S+)\s+(\S+)/\2(\1)/;H
        x;s/(\n\S+)\((\S+)\)(.*)\1\((\S+)\)/\1(\2,\4)\3/;x;$!d;x;s/.//' file

Append each manipulated line to the hold space.

Before moving on to the next line, accumlate like keys into a single line.

Delete every line except the last.

Replace the last line by the contents of the hold space.

Remove the first character (newline artefact introduced by H comand) and print the result.

NB The final solution is unsorted and in the original order.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM