I am new to shell scripting, it would be great if I can get some help with the question below.
I want to read a text file line by line, and print all matched patterns in that line to a line in a new text file.
For example:
$ cat input.txt
SYSTEM ERROR: EU-1C0A Report error -- SYSTEM ERROR: TM-0401 DEFAULT Test error
SYSTEM ERROR: MG-7688 DEFAULT error -- SYSTEM ERROR: DN-0A00 Error while getting object -- ERROR: DN-0A52 DEFAULT Error -- ERROR: MG-3218 error occured in HSSL
SYSTEM ERROR: DN-0A00 Error while getting object -- ERROR: DN-0A52 DEFAULT Error
SYSTEM ERROR: EU-1C0A error Failed to fill in test report -- ERROR: MG-7688
The intended output is as follows:
$ cat output.txt
EU-1C0A TM-0401
MG-7688 DN-0A00 DN-0A52 MG-3218
DN-0A00 DN-0A52
EU-1C0A MG-7688
I tried the following code:
while read p; do
grep -o '[A-Z]\{2\}-[A-Z0-9]\{4\}' | xargs
done < input.txt > output.txt
which produced this output:
EU-1C0A TM-0401 MG-7688 DN-0A00 DN-0A52 MG-3218 DN-0A00 DN-0A52 EU-1C0A MG-7688 .......
Then I also tried this:
while read p; do
grep -o '[A-Z]\{2\}-[A-Z0-9]\{4\}' | xargs > output.txt
done < input.txt
But did not help :(
Maybe there is another way, I am open to awk/sed/cut or whatever... :)
Note: There can be any number of Error codes (ie XX:XXXX, the pattern of interest in a single line).
% awk 'BEGIN{RS=": "};NR>1{printf "%s%s", $1, ($0~/\n/)?"\n":" "}' input.txt
EU-1C0A TM-0401
MG-7688 DN-0A00 DN-0A52 MG-3218
DN-0A00 DN-0A52
EU-1C0A MG-7688
Explanation in longform:
awk '
BEGIN{ RS=": " } # Set the record separator to colon-space
NR>1 { # Ignore the first record
printf("%s%s", # Print two strings:
$1, # 1. first field of the record (`$1`)
($0~/\n/) ? "\n" : " ")
# Ternary expression, read as `if condition (thing
# between brackets), then thing after `?`, otherwise
# thing after `:`.
# So: If the record ($0) matches (`~`) newline (`\n`),
# then put a newline. Otherwise, put a space.
}
' input.txt
Previous answer to the unmodified question:
% awk 'BEGIN{RS=": "};NR>1{printf "%s%s", $1, (NR%2==1)?"\n":" "}' input.txt
EU-1C0A TM-0401
MG-7688 MG-3218
DN-0A00 DN-0A52
EU-1C0A MG-7688
edit: With safeguard against :
-injection (thx @e0k). Tests that the first field after the record seperator looks like how we expect it to be.
awk 'BEGIN{RS=": "};NR>1 && $1 ~ /^[A-Z]{2}-[A-Z0-9]{4}$/ {printf "%s%s", $1, ($0~/\n/)?"\n":" "}' input.txt
There's always perl! And this will grab any number of matches per line.
perl -nle '@matches = /[A-Z]{2}-[A-Z0-9]{4}/g; print(join(" ", @matches)) if (scalar @matches);' output.txt
-e
perl code to be run by compiler and -n
run one line at a time and -l
automatically chomps the line and adds a newline to prints.
The regex implicitly matches against $_
. So @matches = $_ =~ //g
is overly verbose.
If there is no match, this will not print anything.
You could always keep it extremely simple:
$ awk '{o=""; for (i=1;i<=NF;i++) if ($i=="ERROR:") o=o$(i+1)" "; print o}' input.txt
EU-1C0A TM-0401
MG-7688 DN-0A00 DN-0A52 MG-3218
DN-0A00 DN-0A52
EU-1C0A MG-7688
The above will add a blank char to the end of each line, trivially avoided if you care...
To keep your grep
pattern, here's a way:
while IFS='' read -r p; do
echo $(grep -o '[A-Z]\{2\}-[A-Z0-9]\{4\}' <<<"$p")
done < input.txt > output.txt
while IFS='' read -rp; do
while IFS='' read -rp; do
is the standard way to read line-by-line into a variable. See, eg, this answer . grep -o '[AZ]\\{2\\}-[A-Z0-9]\\{4\\}' <<<"$p"
runs your grep and prints the matches. The <<<"$p"
is a "here string" that provides the string $p
(the line that was read in) as stdin
to grep
. This means grep
will search the contents of $p
and print each match on its own line. echo $(grep ...)
converts the newlines in grep
's output to spaces, and adds a newline at the end. Since this loop happens for each line, the result is to print each input line's matches on a single line of the output. done < input.txt > output.txt
is correct: you are providing input to, and taking output from, the loop as a whole. You don't need redirection within the loop. 如果你知道,每行包含你要匹配的字符串恰好两个实例,工程另一种解决方案:
cat input.txt | grep -o '[A-Z]\{2\}-[A-Z0-9]\{4\}' | xargs -L2 > output.txt
Here is a solution with awk that is fairly straightforward, but it is not an elegant one-liner (as many awk solutions tend to be). It should work with any number of your error codes per line, and with an error code defined as a field (white space separated word) that matches a given regex. Since it's not a snazzy one-liner, I stored the program in a file:
codes.awk
#!/usr/bin/awk -f
{
m=0;
for (i=1; i<=NF; ++i) {
if ( $i ~ /^[A-Z][A-Z]-[A-Z0-9][A-Z0-9][A-Z0-9][A-Z0-9]$/ ) {
if (m>0) printf OFS
printf $i
m++
}
}
if (m>0) printf ORS
}
You would run this like
$ awk -f codes.awk input.txt
I hope you find it fairly easy to read. It runs the block once for each line of input. It iterates over each field and checks if it matches a regular expression, then prints the field if it does. The variable m
keeps track of the number of matched fields on the current line so far. The purpose of this is to print the output field separator OFS
(a space by default) between the matched fields only as needed and to use the output record separator ORS
(a newline by default) only if there was at least one error code found. This prevents unnecessary white space.
Notice that I have changed your regular expression from [AZ]{2}-[A-Z0-9]{4}
to [AZ][AZ]-[A-Z0-9][A-Z0-9][A-Z0-9][A-Z0-9]
. This is because old awk
will not (or at least may not ) support interval expressions (the {n}
parts). You could use [AZ]{2}-[A-Z0-9]{4}
with gawk
, however. You can tweak the regex as needed. (In both awk and gawk, regular expressions are delimited by /
.)
The regex /[AZ]{2}-[A-Z0-9]{4}/
would match any field that contains your XX-XXXX pattern of letters and digits. You want the field to be a full match to the regex and not just include something that matches that pattern. To do this, the ^
and $
marks the beginning and end of the string. For example, /^[AZ]{2}-[A-Z0-9]{4}$/
(with gawk) would match US-BOTZ
, but not USA-ROBOTS
. Without the ^
and $
, USA-ROBOTS
would match because it includes a substring SA-ROBO
that does match the regex.
grep -n
with AWK grep -n -o '[A-Z]\{2\}-[A-Z0-9]\{4\}' file | awk -F: -vi=0 '{
printf("%s%s", i ? (i == $1 ? " " : "\n") : "", $2)
i = $1
}'
The idea is to join the lines from the output of grep -n
:
1:EU-1C0A
1:TM-0401
2:MG-7688
2:DN-0A00
2:DN-0A52
2:MG-3218
3:DN-0A00
3:DN-0A52
4:EU-1C0A
4:MG-7688
by the line numbers. AWK initializes the field separator ( -F:
) and the i
variable ( -vi=0
), then processes the output of the grep
command line by line.
It prints a character depending on conditional expression that tests the value of the first field $1
. If i
is zero (the first iteration ), it prints only the second field $2
. Otherwise, if the first field equals to i
, it prints a space, else a newline ( "\\n"
). After the space/newline the second field is printed.
After printing the next chunk, the value of the first field is stored into i
for the next iterations (lines): i = $1
.
grep -n
in Perl use strict;
use warnings;
my $p = 0;
while (<>) {
/^(\d+):(.*)$/;
print $p == $1 ? " " : "\n" if $p;
print $2;
$p = $1;
}
Usage: grep -n -o '[AZ]\\{2\\}-[A-Z0-9]\\{4\\}' file | perl script.pl
grep -n -o '[AZ]\\{2\\}-[A-Z0-9]\\{4\\}' file | perl script.pl
.
But Perl is actually so flexible and powerful that you can solve the problem completely with a single line:
perl -lne 'print @_ if @_ = /([A-Z]{2}-[A-Z\d]{4})/g' < file
I've seen a similar solution in one of the answers here. Still I decided to post it as it is more compact.
One of the key ideas is using the -l
switch that
$/
; $\\
to have the value of $/
(which is newline by default) The value of output record separator, if defined, is printed after the last argument passed to print
. As a result, the script prints all matches ( @_
, in particular ) followed by a newline.
The @_
variable is usually used as an array of subroutine parameters. I have used it in the script only for the sake of shortness.
In Gnu awk. Supports multiple matches on each record:
$ awk '
{
while(match($0, /[A-Z]{2}-[A-Z0-9]{4}/)) { # find first match on record
b=b substr($0,RSTART,RLENGTH) OFS # buffer the match
$0=substr($0,RSTART+RLENGTH) # truncate from start of record
}
if(b!="") print b # print buffer if not empty
b="" # empty buffer
}' file
EU-1C0A TM-0401
MG-7688 DN-0A00 DN-0A52 MG-3218
DN-0A00 DN-0A52
EU-1C0A MG-7688
Downside: there will be an extra OFS in the end of each printed record.
If you want to use other awks than Gnu awk, replace the regex match
with:
while(match($0, /[A-Z][A-Z]-[A-Z0-9][A-Z0-9][A-Z0-9]/))
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.