简体   繁体   中英

Extract all the occurrences of strings enclosed between < & > in ubuntu bash

In tomcat access log file, in addition to default patternLayout (ie common) we have 3 additional fields (request headers) in each line enclosed within < & > characters.

Pattern: ... <AJsonString> <User-Agent> <ReferrerURL>

Sample log:

  1. <{'id':'uuid'}> <Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.71 Safari/537.36> \<https://someurl\>
  2. <-> <Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.71 Safari/537.36> <->


I need to extract the string between < & > characters like subStr1 = AJsonString , subStr2 = User-Agent & subStr3 = ReferrerURL . How can I achieve this in ubuntu bash?

From each access log line I'm able to extract the above Sample data using grep -o '<.*>' What could be next part I should do? I'm using "GNU bash, version 4.3.48(1)-release (x86_64-pc-linux-gnu)"

Or are there any alternatives to do the whole process in a simplest/better way?

I'm new to scripting & any suggestions, pointers would be helpful.

Thanks for your time:)

Considering this is your line (that does not use pipes | ):

LINE="<{'id':'uuid'}> <Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.71 Safari/537.36> <https://someurl>" 

you could use this:

IFS='|' read json ua url <<<$(echo $LINE|perl -ne 'm{<([^>]+)>\s*<([^>]+)>\s*<([^>]+)>}; print "$1|$2|$3"')

Now the variables json , ua and url will have the data:

$ echo $json
$ echo $ua
Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.71 Safari/537.36
$ echo $url

How it works:

echo $LINE|perl -ne 'm{<([^>]+)>\s*<([^>]+)>\s*<([^>]+)>}; print "$1|$2|$3"'

This executes perl with the -n option, that makes an implicit loop for reading the arguments, that are assigned to the topic/default variable $_ . Then you execute this regex against that variable:

<               # a literal '<'
   (            # start of capturing group
     [^>]+      # a character that cannot be '>' one or more times
   )            # end of capturin group (this particular group will be saved to $2)
>               # a literal '>'
\s*             # any whitespace-like character 0 or more times

The data is captured in the variables $1 , $2 and $3 which later is printed in the same perl script separated by pipes ( print "$1|$2|$3" )

This is inputed ( <<<$(somecommand) ) to the read command, that will assign the data to the variables. Before that we change the Field Separator variable to use pipes ( IFS='|' ) because the default value is spaces.


If your line can have pipes you shoud change both IFS and the perl script to use another character


Your first example line, had some scaped < and > with backslashes. The second example did not. In my answer I'm considering those backslashes do not exist. In case you could have them, the regex should be changed. Changing \s* to [\s\\]* should work


Here's another alternative using sed instead of perl :

LINE="<{'id':'uuid'}> <Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.71 Safari/537.36> <https://someurl>"

IFS='|' read json ua url <<<$(echo $LINE|sed -E "s/>[ \\]*</|/g"|sed -E "s/^[ \\<]+|[ \\>]+$//g")

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

粤ICP备18138465号  © 2020-2024 STACKOOM.COM