In tomcat access log file, in addition to default patternLayout (ie common) we have 3 additional fields (request headers) in each line enclosed within <
& >
characters.
Pattern: ... <AJsonString> <User-Agent> <ReferrerURL>
Sample log:
<{'id':'uuid'}> <Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.71 Safari/537.36> \<https://someurl\>
<-> <Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.71 Safari/537.36> <->
Requirement:
I need to extract the string between <
& >
characters like subStr1 = AJsonString
, subStr2 = User-Agent
& subStr3 = ReferrerURL
. How can I achieve this in ubuntu bash?
From each access log line I'm able to extract the above Sample data using grep -o '<.*>'
What could be next part I should do? I'm using "GNU bash, version 4.3.48(1)-release (x86_64-pc-linux-gnu)"
Or are there any alternatives to do the whole process in a simplest/better way?
I'm new to scripting & any suggestions, pointers would be helpful.
Thanks for your time:)
Considering this is your line (that does not use pipes |
):
LINE="<{'id':'uuid'}> <Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.71 Safari/537.36> <https://someurl>"
you could use this:
IFS='|' read json ua url <<<$(echo $LINE|perl -ne 'm{<([^>]+)>\s*<([^>]+)>\s*<([^>]+)>}; print "$1|$2|$3"')
Now the variables json , ua and url will have the data:
$ echo $json
{'id':'uuid'}
$ echo $ua
Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.71 Safari/537.36
$ echo $url
https://someurl
How it works:
echo $LINE|perl -ne 'm{<([^>]+)>\s*<([^>]+)>\s*<([^>]+)>}; print "$1|$2|$3"'
This executes perl with the -n
option, that makes an implicit loop for reading the arguments, that are assigned to the topic/default variable $_ . Then you execute this regex against that variable:
<([^>]+)>\s*
< # a literal '<'
( # start of capturing group
[^>]+ # a character that cannot be '>' one or more times
) # end of capturin group (this particular group will be saved to $2)
> # a literal '>'
\s* # any whitespace-like character 0 or more times
<([^>]+)>
The data is captured in the variables $1 , $2 and $3 which later is printed in the same perl script separated by pipes ( print "$1|$2|$3"
)
This is inputed ( <<<$(somecommand)
) to the read command, that will assign the data to the variables. Before that we change the Field Separator variable to use pipes ( IFS='|'
) because the default value is spaces.
NOTE1:
If your line can have pipes you shoud change both IFS and the perl script to use another character
NOTE2:
Your first example line, had some scaped < and > with backslashes. The second example did not. In my answer I'm considering those backslashes do not exist. In case you could have them, the regex should be changed. Changing \s*
to [\s\\]*
should work
NOTE3:
Here's another alternative using sed
instead of perl
:
LINE="<{'id':'uuid'}> <Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.71 Safari/537.36> <https://someurl>"
IFS='|' read json ua url <<<$(echo $LINE|sed -E "s/>[ \\]*</|/g"|sed -E "s/^[ \\<]+|[ \\>]+$//g")
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.