简体   繁体   中英

How do I know which delimiter has occurred first using awk in bash?

How to know which delimiter has occurred first using single awk line.

Assume I have a file having contents:

AB BC DE
BC DE AB
DE BC AB

And I want to know which of the three DE , AB , BC has occurred first in each line.

I thought that I could use delimiter BC then take its first field and then BC and then take the first field of AB .

This can be done by:

$ awk -F'AB' '{print $1}' <file>   \
  | awk -F'BC' '{print $1}' <file> \
  | awk -F'DE' '{print $1}' <file>

However, is there any other way in which I can dynamically change delimiter inside awk line and get the above thing done using awk only once?

Edit: Corrected the mistakes done earlier.

If this isn't what you want:

awk 'match($0,/AB|BC|DE/){print substr($0,RSTART,RLENGTH)}' file

then edit your question to clarify your requirements and provide concise, testable sample input and expected output.

First of all, if your file only contains the combinations AB , BC or DE in combination with newline , then the answer is straightforward :

awk '{print $1}' file

This is conform your example. Nonetheless, I do not believe this is the case. It stands to reason that the solution of Ed Morton is clearly the way to forward! It is clean, simple and on top of that a one-liner.

However, from a pure educational perspective, a different awk approach is presented here.

If you want to find the "first" separator in a line, you could attack the problem from a different angle. Instead of interpreting the line as a set of columns, you could understand it as a set of records. This brings the question to "which record separator has been found first :

RT (gawk extention) The input text that matched the text denoted by RS , the record separator. It is set every time a record is read.

For a single line of characters, you could do something like this :

$ echo "AB BC DE BC DE AB DE BC AB" \
   | awk 'BEGIN{RS="DE|AB|BC"}{print RT;exit }' 
AB

Now it is possible to play with the idea a bit more. Constantly toggle the RS between a newline and the requested set. This is just to show how flexible awk is.

$ awk 'BEGIN{RSSET="DE|AB|BC";RS=RSSET}
       (RS=="\n"){RS=RSSET;next}
       {print RT; RS="\n"; next}' file

If file is

AB BC DE BC DE AB DE BC AB
BC DE AB DE BC AB
DE AB DE BC AB

it outputs

AB
BC
DE

A sed solution, as it was tagged. The greedy nature of sed made this a tad more confusing, but I think the following works.

#!/usr/bin/sed -rnf

# This presumes you only want to print matching rows.
/(AB|CD|EF)/ {
    # add a line number
    =;
    # find first match, then remove rest of line
    s/(AB|CD|EF).*$/\1/;
    # this only leaves one possible match, so the greedy match all 
    # at the start doesnt match what we want.
    s/^.*(AB|CD|EF)/\1/; 
    # so print.
    p 
}

And for an example, I've changed the 'codes' to check it was the first being matched:

~$> printf "%b\n" "$letters"
ABa BBa ABb BBb ABc BBc
BBc ABc BBb ABb BBa ABa
ABb ABc BBa BBc
not right

~$> echo "$letters" | sed -rn '/(AB.|BB.)/ {=; s/(AB.|BB.).*$/\1/; s/^.*(AB.|BB.)/ \1/; p }'
1
 ABa
2
 BBc
3
 ABb

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM