Extract all the occurrences of strings enclosed between < & > in ubuntu bash

Question

In tomcat access log file, in addition to default patternLayout (ie common) we have 3 additional fields (request headers) in each line enclosed within < & > characters.

Pattern: ... <AJsonString> <User-Agent> <ReferrerURL>

Sample log:

<{'id':'uuid'}> <Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.71 Safari/537.36> \<https://someurl\>
<-> <Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.71 Safari/537.36> <->

Requirement:

I need to extract the string between < & > characters like subStr1 = AJsonString , subStr2 = User-Agent & subStr3 = ReferrerURL . How can I achieve this in ubuntu bash?

From each access log line I'm able to extract the above Sample data using grep -o '<.*>' What could be next part I should do? I'm using "GNU bash, version 4.3.48(1)-release (x86_64-pc-linux-gnu)"

Or are there any alternatives to do the whole process in a simplest/better way?

I'm new to scripting & any suggestions, pointers would be helpful.

Thanks for your time:)

Answer 1

Considering this is your line (that does not use pipes | ):

LINE="<{'id':'uuid'}> <Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.71 Safari/537.36> <https://someurl>"

you could use this:

IFS='|' read json ua url <<<$(echo $LINE|perl -ne 'm{<([^>]+)>\s*<([^>]+)>\s*<([^>]+)>}; print "$1|$2|$3"')

Now the variables json , ua and url will have the data:

$ echo $json
{'id':'uuid'}
$ echo $ua
Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.71 Safari/537.36
$ echo $url
https://someurl

How it works:

echo $LINE|perl -ne 'm{<([^>]+)>\s*<([^>]+)>\s*<([^>]+)>}; print "$1|$2|$3"'

This executes perl with the -n option, that makes an implicit loop for reading the arguments, that are assigned to the topic/default variable $_ . Then you execute this regex against that variable:

<([^>]+)>\s*
<               # a literal '<'
   (            # start of capturing group
     [^>]+      # a character that cannot be '>' one or more times
   )            # end of capturin group (this particular group will be saved to $2)
>               # a literal '>'
\s*             # any whitespace-like character 0 or more times
<([^>]+)>

The data is captured in the variables $1 , $2 and $3 which later is printed in the same perl script separated by pipes ( print "$1|$2|$3" )

This is inputed ( <<<$(somecommand) ) to the read command, that will assign the data to the variables. Before that we change the Field Separator variable to use pipes ( IFS='|' ) because the default value is spaces.

NOTE1:

If your line can have pipes you shoud change both IFS and the perl script to use another character

NOTE2:

Your first example line, had some scaped < and > with backslashes. The second example did not. In my answer I'm considering those backslashes do not exist. In case you could have them, the regex should be changed. Changing \s* to [\s\\]* should work

NOTE3:

Here's another alternative using sed instead of perl :

LINE="<{'id':'uuid'}> <Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.71 Safari/537.36> <https://someurl>"

IFS='|' read json ua url <<<$(echo $LINE|sed -E "s/>[ \\]*</|/g"|sed -E "s/^[ \\<]+|[ \\>]+$//g")

Extract all the occurrences of strings enclosed between < & > in ubuntu bash

Question

1 answers

solution1
-1 2022-01-07 11:23:09

Extract all the occurrences of strings enclosed between < & > in ubuntu bash

Question

1 answers

solution1 -1 2022-01-07 11:23:09

solution1
-1 2022-01-07 11:23:09