简体   繁体   中英

Correct regular expression for the input log

Input log looks like this, which contains data which are "|" sperated. The data contains id | type | request | response

110000|read|<soapenv:Envelope xmlns:soapenv="http://schemas.xmlsoap.org/soap/envelope/" xmlns:web="http://webservices.lookup.sdp.bharti.ibm.com">
</soapenv:Envelope>|<soapenv:Envelope xmlns:soapenv="http://schemas.xmlsoap.org/soap/envelope/">
<ns:getLookUpServiceDetailsResponse xmlns:ns="http://webservices.lookup.sdp.bharti.ibm.com">
210000|read|<soapenv:Envelope xmlns:soapenv="http://schemas.xmlsoap.org/soap/envelope/" xmlns:web="http://webservices.lookup.sdp.bharti.ibm.com">
</soapenv:Envelope>|<soapenv:Envelope xmlns:soapenv="http://schemas.xmlsoap.org/soap/envelope/">
<ns:getLookUpServiceDetailsResponse xmlns:ns="http://webservices.lookup.sdp.bharti.ibm.com">
340000|read|<soapenv:Envelope xmlns:soapenv="http://schemas.xmlsoap.org/soap/envelope/" xmlns:web="http://webservices.lookup.sdp.bharti.ibm.com">
</soapenv:Envelope>|<soapenv:Envelope xmlns:soapenv="http://schemas.xmlsoap.org/soap/envelope/">
<ns:getLookUpServiceDetailsResponse xmlns:ns="http://webservices.lookup.sdp.bharti.ibm.com">
450000|read|<soapenv:Envelope xmlns:soapenv="http://schemas.xmlsoap.org/soap/envelope/" xmlns:web="http://webservices.lookup.sdp.bharti.ibm.com">
</soapenv:Envelope>|<soapenv:Envelope xmlns:soapenv="http://schemas.xmlsoap.org/soap/envelope/">
<ns:getLookUpServiceDetailsResponse xmlns:ns="http://webservices.lookup.sdp.bharti.ibm.com">
590000|read|<soapenv:Envelope xmlns:soapenv="http://schemas.xmlsoap.org/soap/envelope/" xmlns:web="http://webservices.lookup.sdp.bharti.ibm.com">
</soapenv:Envelope>|<soapenv:Envelope xmlns:soapenv="http://schemas.xmlsoap.org/soap/envelope/">
<ns:getLookUpServiceDetailsResponse xmlns:ns="http://webservices.lookup.sdp.bharti.ibm.com">

desired output:

1st log:

id- 110000


request-<soapenv:Envelope xmlns:soapenv="http://schemas.xmlsoap.org/soap/envelope/" xmlns:web="http://webservices.lookup.sdp.bharti.ibm.com">

response-<soapenv:Envelope xmlns:soapenv="http://schemas.xmlsoap.org/soap/envelope/">
<ns:getLookUpServiceDetailsResponse xmlns:ns="http://webservices.lookup.sdp.bharti.ibm.com">

for 2nd log :

id - 210000

type - read

request -

response - 

Similarly for the "n" no. of logs

configuration file used:

input {
  file {
    path => "/opt/test5/practice_new/final_xml.dat"
    start_position => "beginning"
    codec => multiline {
            pattern => "^%{NUMBER:method_id}\|%{DATA:method_type}\|<soapenv:Envelope>"
            negate => true
            what => previous
filter {
  grok {
    match => [ "message", "(?m)^(?<method_id>\d+)\|(?<method_type>\w+)\|(?<request><soapenv:Envelope>.*?</soapenv:Envelope>)\|(?<response><soapenv:Envelope>.*?</soapenv:Envelope>)" ]

output {
   elasticsearch {
     hosts => "http://localhost:9200"
     index => "final"
stdout {}

I tried using the regular expression in Grok but the current one is not working for the input logs.

please help me with the regular expression.

The regex you currently are using is (?m)^(?<method_id>\\d+)\\|(?<method_type>\\w+)\\|(?<request><soapenv:Envelope>.*?</soapenv:Envelope>)\\|(?<response><soapenv:Envelope>.*?</soapenv:Envelope>) , and it can only parse out the 3rd and 4th columns if they start with <soapenv:Envelope> and end with </soapenv:Envelope> having | in between.

It seems you need a regex that will identify the 3rd column as a sequence of any chars other than | and the 4th column should gran any number of chars other than | up to the newline followed with 1 or more digits and then | .



See the regex demo .


  • (?m) - the Ruby modifier that makes . match line break chars
  • ^ - start of a line
  • (?<method_id>\\d+) - Group "method": one or more digits
  • \\| - a pipe char
  • (?<method_type>\\w+) - Group "method_type": one or more letters, digits or _
  • \\| - a pipe
  • (?<request>[^|]*) - Group "request": any 0+ chars other than |
  • \\| - a pipe
  • (?<response>[^|\\n]*(?:\\n(?!\\d+\\|)[^|\\n]*)*) - Group "response":
    • [^|\\n]* - any 0+ chars other than | and LF (newlines)
    • (?:\\n(?!\\d+\\|)[^|\\n]*)* - 0+ occurrences of:
      • \\n - a newline
      • (?!\\d+\\|) - not followed with 1+ digits + |
      • [^|\\n]* - any 0+ chars other than | and LF (newlines)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

粤ICP备18138465号  © 2020-2024 STACKOOM.COM