简体   繁体   中英

Read line by line and print matches line by line

I am new to shell scripting, it would be great if I can get some help with the question below.

I want to read a text file line by line, and print all matched patterns in that line to a line in a new text file.

For example:

$ cat input.txt

SYSTEM ERROR: EU-1C0A  Report error -- SYSTEM ERROR: TM-0401 DEFAULT Test error
SYSTEM ERROR: MG-7688 DEFAULT error -- SYSTEM ERROR: DN-0A00 Error while getting object -- ERROR: DN-0A52 DEFAULT Error -- ERROR: MG-3218 error occured in HSSL
SYSTEM ERROR: DN-0A00 Error while getting object -- ERROR: DN-0A52 DEFAULT Error
SYSTEM ERROR: EU-1C0A  error Failed to fill in test report -- ERROR: MG-7688

The intended output is as follows:

$ cat output.txt

EU-1C0A TM-0401
MG-7688 DN-0A00 DN-0A52 MG-3218
DN-0A00 DN-0A52
EU-1C0A MG-7688

I tried the following code:

while read p; do
    grep -o '[A-Z]\{2\}-[A-Z0-9]\{4\}' | xargs
done < input.txt > output.txt

which produced this output:

EU-1C0A TM-0401 MG-7688 DN-0A00 DN-0A52 MG-3218 DN-0A00 DN-0A52 EU-1C0A MG-7688 .......

Then I also tried this:

while read p; do
    grep -o '[A-Z]\{2\}-[A-Z0-9]\{4\}' | xargs > output.txt
done < input.txt

But did not help :(

Maybe there is another way, I am open to awk/sed/cut or whatever... :)

Note: There can be any number of Error codes (ie XX:XXXX, the pattern of interest in a single line).

% awk 'BEGIN{RS=": "};NR>1{printf "%s%s", $1, ($0~/\n/)?"\n":" "}' input.txt 
EU-1C0A TM-0401
MG-7688 DN-0A00 DN-0A52 MG-3218
DN-0A00 DN-0A52
EU-1C0A MG-7688

Explanation in longform:

awk '
    BEGIN{ RS=": " } # Set the record separator to colon-space
    NR>1 {           # Ignore the first record
        printf("%s%s", # Print two strings:
            $1,      # 1. first field of the record (`$1`)
            ($0~/\n/) ? "\n" : " ")
                     # Ternary expression, read as `if condition (thing
                     # between brackets), then thing after `?`, otherwise
                     # thing after `:`.
                     # So: If the record ($0) matches (`~`) newline (`\n`),
                     # then put a newline. Otherwise, put a space.
    }
' input.txt 

Previous answer to the unmodified question:

% awk 'BEGIN{RS=": "};NR>1{printf "%s%s", $1, (NR%2==1)?"\n":" "}' input.txt 
EU-1C0A TM-0401
MG-7688 MG-3218
DN-0A00 DN-0A52
EU-1C0A MG-7688

edit: With safeguard against : -injection (thx @e0k). Tests that the first field after the record seperator looks like how we expect it to be.

awk 'BEGIN{RS=": "};NR>1 && $1 ~ /^[A-Z]{2}-[A-Z0-9]{4}$/ {printf "%s%s", $1, ($0~/\n/)?"\n":" "}' input.txt

There's always perl! And this will grab any number of matches per line.

perl -nle '@matches = /[A-Z]{2}-[A-Z0-9]{4}/g; print(join(" ", @matches)) if (scalar @matches);' output.txt

-e perl code to be run by compiler and -n run one line at a time and -l automatically chomps the line and adds a newline to prints.

The regex implicitly matches against $_ . So @matches = $_ =~ //g is overly verbose.

If there is no match, this will not print anything.

You could always keep it extremely simple:

$ awk '{o=""; for (i=1;i<=NF;i++) if ($i=="ERROR:") o=o$(i+1)" "; print o}' input.txt
EU-1C0A TM-0401
MG-7688 DN-0A00 DN-0A52 MG-3218
DN-0A00 DN-0A52
EU-1C0A MG-7688

The above will add a blank char to the end of each line, trivially avoided if you care...

To keep your grep pattern, here's a way:

while IFS='' read -r p; do
    echo $(grep -o '[A-Z]\{2\}-[A-Z0-9]\{4\}' <<<"$p")
done < input.txt > output.txt
  • while IFS='' read -rp; do while IFS='' read -rp; do is the standard way to read line-by-line into a variable. See, eg, this answer .
  • grep -o '[AZ]\\{2\\}-[A-Z0-9]\\{4\\}' <<<"$p" runs your grep and prints the matches. The <<<"$p" is a "here string" that provides the string $p (the line that was read in) as stdin to grep . This means grep will search the contents of $p and print each match on its own line.
  • echo $(grep ...) converts the newlines in grep 's output to spaces, and adds a newline at the end. Since this loop happens for each line, the result is to print each input line's matches on a single line of the output.
  • done < input.txt > output.txt is correct: you are providing input to, and taking output from, the loop as a whole. You don't need redirection within the loop.

如果你知道,每行包含你要匹配的字符串恰好两个实例,工程另一种解决方案:

cat input.txt | grep -o '[A-Z]\{2\}-[A-Z0-9]\{4\}' | xargs -L2 > output.txt

Here is a solution with awk that is fairly straightforward, but it is not an elegant one-liner (as many awk solutions tend to be). It should work with any number of your error codes per line, and with an error code defined as a field (white space separated word) that matches a given regex. Since it's not a snazzy one-liner, I stored the program in a file:

codes.awk

#!/usr/bin/awk -f
{
    m=0;
    for (i=1; i<=NF; ++i) {
        if ( $i ~ /^[A-Z][A-Z]-[A-Z0-9][A-Z0-9][A-Z0-9][A-Z0-9]$/ ) {
            if (m>0) printf OFS
            printf $i
            m++
        }
    }
    if (m>0) printf ORS
}

You would run this like

$ awk -f codes.awk input.txt

I hope you find it fairly easy to read. It runs the block once for each line of input. It iterates over each field and checks if it matches a regular expression, then prints the field if it does. The variable m keeps track of the number of matched fields on the current line so far. The purpose of this is to print the output field separator OFS (a space by default) between the matched fields only as needed and to use the output record separator ORS (a newline by default) only if there was at least one error code found. This prevents unnecessary white space.

Notice that I have changed your regular expression from [AZ]{2}-[A-Z0-9]{4} to [AZ][AZ]-[A-Z0-9][A-Z0-9][A-Z0-9][A-Z0-9] . This is because old awk will not (or at least may not ) support interval expressions (the {n} parts). You could use [AZ]{2}-[A-Z0-9]{4} with gawk , however. You can tweak the regex as needed. (In both awk and gawk, regular expressions are delimited by / .)

The regex /[AZ]{2}-[A-Z0-9]{4}/ would match any field that contains your XX-XXXX pattern of letters and digits. You want the field to be a full match to the regex and not just include something that matches that pattern. To do this, the ^ and $ marks the beginning and end of the string. For example, /^[AZ]{2}-[A-Z0-9]{4}$/ (with gawk) would match US-BOTZ , but not USA-ROBOTS . Without the ^ and $ , USA-ROBOTS would match because it includes a substring SA-ROBO that does match the regex.

Parsing grep -n with AWK

grep -n -o '[A-Z]\{2\}-[A-Z0-9]\{4\}' file | awk -F: -vi=0 '{
  printf("%s%s", i ? (i == $1 ? " " : "\n") : "", $2)
  i = $1
}'

The idea is to join the lines from the output of grep -n :

1:EU-1C0A
1:TM-0401
2:MG-7688
2:DN-0A00
2:DN-0A52
2:MG-3218
3:DN-0A00
3:DN-0A52
4:EU-1C0A
4:MG-7688

by the line numbers. AWK initializes the field separator ( -F: ) and the i variable ( -vi=0 ), then processes the output of the grep command line by line.

It prints a character depending on conditional expression that tests the value of the first field $1 . If i is zero (the first iteration ), it prints only the second field $2 . Otherwise, if the first field equals to i , it prints a space, else a newline ( "\\n" ). After the space/newline the second field is printed.

After printing the next chunk, the value of the first field is stored into i for the next iterations (lines): i = $1 .

Perl

Parsing grep -n in Perl

use strict;
use warnings;

my $p = 0;

while (<>) {
  /^(\d+):(.*)$/;
  print $p == $1 ? " " : "\n" if $p;
  print $2;
  $p = $1;
}

Usage: grep -n -o '[AZ]\\{2\\}-[A-Z0-9]\\{4\\}' file | perl script.pl grep -n -o '[AZ]\\{2\\}-[A-Z0-9]\\{4\\}' file | perl script.pl .

Single Line

But Perl is actually so flexible and powerful that you can solve the problem completely with a single line:

perl -lne 'print @_ if @_ = /([A-Z]{2}-[A-Z\d]{4})/g' < file

I've seen a similar solution in one of the answers here. Still I decided to post it as it is more compact.

One of the key ideas is using the -l switch that

  1. automatically chomps the input record separator $/ ;
  2. assigns the output record separator $\\ to have the value of $/ (which is newline by default)

The value of output record separator, if defined, is printed after the last argument passed to print . As a result, the script prints all matches ( @_ , in particular ) followed by a newline.

The @_ variable is usually used as an array of subroutine parameters. I have used it in the script only for the sake of shortness.

In Gnu awk. Supports multiple matches on each record:

$ awk '
{
    while(match($0, /[A-Z]{2}-[A-Z0-9]{4}/)) {  # find first match on record
        b=b substr($0,RSTART,RLENGTH) OFS       # buffer the match
        $0=substr($0,RSTART+RLENGTH)            # truncate from start of record
    }
    if(b!="") print b                           # print buffer if not empty
    b=""                                        # empty buffer
}' file
EU-1C0A TM-0401 
MG-7688 DN-0A00 DN-0A52 MG-3218 
DN-0A00 DN-0A52 
EU-1C0A MG-7688 

Downside: there will be an extra OFS in the end of each printed record.

If you want to use other awks than Gnu awk, replace the regex match with:

while(match($0, /[A-Z][A-Z]-[A-Z0-9][A-Z0-9][A-Z0-9]/))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM