简体   繁体   中英

Splitting logfile with Regular Expressions

I got lines like this one on a log file but I am having problem with my regular expressions. 127.0.0.1 192.168.1.1 1050 1050 127.0.0.1 - GET 8080 ?action=edit&studentId=1 - [24/May/2016:19:33:52 +0300] "GET /CRUDProject/StudentController.do?action=edit&studentId=1 HTTP/1.1" 200 /CRUDProject/StudentController.do 264 ABADDD8AFB03ECC4791D76E543290226 "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36" "http://127.0.0.1:8080/CRUDProject/StudentController.do"

Here is my code in a Netbeans project :

public class LogRegExp1 {

public static void main(String argv[]) {
    FileReader myFile = null;
    BufferedReader buff = null;

    String logEntryPattern = "^([\\d.]+|[\\d:]+) (\\S+) (\\S+) (\\S+) (\\S+) (\\S+) (\\S+) ([\\d]+) [a-zA-Z0-9_ ]*(\\S+) [-]?[ ]?\\[([\\w:/] +\\s[+\\-]\\d{4})\\] \\\"(.+?)\\\" (\\d{3}) (\\S+) ([\\d]+) (\\S+) \"(.+?)\\\" \"(.+?)\\\"";  
    System.out.println("Using RE Pattern:");
    System.out.println(logEntryPattern);

    Pattern p = Pattern.compile(logEntryPattern);

    try {
        myFile = new FileReader("e3600_access_log2016-05-24.log");
        buff = new BufferedReader(myFile);

        while (true) {
            String line = buff.readLine();
            if (line == null) {
                break;
            }

            Matcher matcher = p.matcher(line);
            System.out.println("groups: " + matcher.groupCount());
            if (!matcher.matches()) {
                System.err.println(line + matcher.toString());
                return;
            }

            System.out.println("%a Remote IP Address     : " + matcher.group(1));}
    } catch (IOException e) {
        e.printStackTrace();
    } finally {
        try {
            buff.close();
            myFile.close();
        } catch (IOException e) {
            e.printStackTrace();
        }}}}`

As a result I get this :

Using RE Pattern:
^([\d.]+|[\d:]+) (\S+) (\S+) (\S+) (\S+) (\S+) (\S+) ([\d]+) [a-zA-Z0-9_ ]*(\S+) [-]?[ ]?\[([\w:/] +\s[+\-]\d{4})\] \"(.+?)\" (\d{3}) (\S+) ([\d]+) (\S+) "(.+?)\" "(.+?)\"
groups: 17
127.0.0.1 192.168.1.66 1050 1050 127.0.0.1 - GET 8080 ?action=edit&studentId=1 - [24/May/2016:19:33:52 +0300] "GET /CRUDProject/StudentController.do?action=edit&studentId=1 HTTP/1.1" 200 /CRUDProject/StudentController.do 264 ABADDD8AFB03ECC4791D76E543290226 "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36"  "http://127.0.0.1:8080/CRUDProject/StudentController.do"java.util.regex.Matcher[pattern=^([\d.]+|[\d:]+) (\S+) (\S+) (\S+) (\S+) (\S+) (\S+) ([\d]+) [a-zA-Z0-9_ ]*(\S+) [-]?[ ]?\[([\w:/] +\s[+\-]\d{4})\] \"(.+?)\" (\d{3}) (\S+) ([\d]+) (\S+) "(.+?)\" "(.+?)\" region=0,427 lastmatch=]`

All help is apreciated on how and what I am doing wrong and should fix so I can get the results I should. Thanks

Your pattern does not match the log entries. Use a tool like http://regexr.com/ to debug regexes.

This modified pattern matches your sample input:

^([\d.]+|[\d:]+) (\S+) (\S+) (\S+) (\S+) (\S+) (\S+) ([\d]+) [a-zA-Z0-9_ ]*(\S+) [-]?[ ]?\[([\w:/]+ [+\-]\d{4})\] \"(.+?)\" (\d{3}) (\S+) ([\d]+) (\S+) "(.+?)\"  "(.+?)\"

That will probably not solve all your problems, it still looks flaky. Test some more and adapt the pattern.

Description

This regular expression will do the following:

  • match all the substrings in your log message
  • place each matched substring in its own capture group

Note: to use this regex in java, you'll need to replace all the \\ with \\\\ . I've also left the expressions that match each substring on their own lines. If you use this expression in this format you'll need to include the Ignore White Space flag, or simply make the expression a single line. Keep in mind this expression does not do an exhaustive validation on the date or ip address substrings.

^
((?:[0-9]{1,3}\.){3}[0-9]{1,3})\s+
((?:[0-9]{1,3}\.){3}[0-9]{1,3})\s+
([0-9]+)\s+
([0-9]+)\s+
((?:[0-9]{1,3}\.){3}[0-9]{1,3})\s+
-\s+
([a-z]+\s[0-9]+)\s+
(\?[^\s]+)\s+
-\s+
\[([0-9]{1,2}\/(?:Jan|feb|Mar|apr|may|Jun|July|Aug|Sep|Oct|Nov|Dec)\/[0-9]{4}(?::[0-9]{2}){3}\s+\+[0-9]{4})\]\s+
"([^"]+)"\s+
([0-9]+)\s+
([^\s]+)\s+
([0-9]+)\s+
([0-9a-f]+)\s+
"([^"]+)"\s+
"([^"]+)"

正则表达式可视化

To see the image better, you can right click the image and select open in new window.

Example

Live Demo

https://regex101.com/r/mX7gG2/1

Sample text

127.0.0.1 192.168.1.1 1050 1050 127.0.0.1 - GET 8080 ?action=edit&studentId=1 - [24/May/2016:19:33:52 +0300] "GET /CRUDProject/StudentController.do?action=edit&studentId=1 HTTP/1.1" 200 /CRUDProject/StudentController.do 264 ABADDD8AFB03ECC4791D76E543290226 "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36" " http://127.0.0.1:8080/CRUDProject/StudentController.do "

Sample Matches

[0][0] = 127.0.0.1 192.168.1.1 1050 1050 127.0.0.1 - GET 8080 ?action=edit&studentId=1 - [24/May/2016:19:33:52 +0300] "GET /CRUDProject/StudentController.do?action=edit&studentId=1 HTTP/1.1" 200 /CRUDProject/StudentController.do 264 ABADDD8AFB03ECC4791D76E543290226 "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36"  "http://127.0.0.1:8080/CRUDProject/StudentController.do"
[0][1] = 127.0.0.1
[0][2] = 192.168.1.1
[0][3] = 1050
[0][4] = 1050
[0][5] = 127.0.0.1
[0][6] = GET 8080
[0][7] = ?action=edit&studentId=1
[0][8] = 24/May/2016:19:33:52 +0300
[0][9] = GET /CRUDProject/StudentController.do?action=edit&studentId=1 HTTP/1.1
[0][10] = 200
[0][11] = /CRUDProject/StudentController.do
[0][12] = 264
[0][13] = ABADDD8AFB03ECC4791D76E543290226
[0][14] = Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36
[0][15] = http://127.0.0.1:8080/CRUDProject/StudentController.do

Explanation

NODE                     EXPLANATION
----------------------------------------------------------------------
  ^                        the beginning of a "line"
----------------------------------------------------------------------
  (                        group and capture to \1:
----------------------------------------------------------------------
    (?:                      group, but do not capture (3 times):
----------------------------------------------------------------------
      [0-9]{1,3}               any character of: '0' to '9' (between
                               1 and 3 times (matching the most
                               amount possible))
----------------------------------------------------------------------
      \.                       '.'
----------------------------------------------------------------------
    ){3}                     end of grouping
----------------------------------------------------------------------
    [0-9]{1,3}               any character of: '0' to '9' (between 1
                             and 3 times (matching the most amount
                             possible))
----------------------------------------------------------------------
  )                        end of \1
----------------------------------------------------------------------
  \s+                      whitespace (\n, \r, \t, \f, and " ") (1 or
                           more times (matching the most amount
                           possible))
----------------------------------------------------------------------
  (                        group and capture to \2:
----------------------------------------------------------------------
    (?:                      group, but do not capture (3 times):
----------------------------------------------------------------------
      [0-9]{1,3}               any character of: '0' to '9' (between
                               1 and 3 times (matching the most
                               amount possible))
----------------------------------------------------------------------
      \.                       '.'
----------------------------------------------------------------------
    ){3}                     end of grouping
----------------------------------------------------------------------
    [0-9]{1,3}               any character of: '0' to '9' (between 1
                             and 3 times (matching the most amount
                             possible))
----------------------------------------------------------------------
  )                        end of \2
----------------------------------------------------------------------
  \s+                      whitespace (\n, \r, \t, \f, and " ") (1 or
                           more times (matching the most amount
                           possible))
----------------------------------------------------------------------
  (                        group and capture to \3:
----------------------------------------------------------------------
    [0-9]+                   any character of: '0' to '9' (1 or more
                             times (matching the most amount
                             possible))
----------------------------------------------------------------------
  )                        end of \3
----------------------------------------------------------------------
  \s+                      whitespace (\n, \r, \t, \f, and " ") (1 or
                           more times (matching the most amount
                           possible))
----------------------------------------------------------------------
  (                        group and capture to \4:
----------------------------------------------------------------------
    [0-9]+                   any character of: '0' to '9' (1 or more
                             times (matching the most amount
                             possible))
----------------------------------------------------------------------
  )                        end of \4
----------------------------------------------------------------------
  \s+                      whitespace (\n, \r, \t, \f, and " ") (1 or
                           more times (matching the most amount
                           possible))
----------------------------------------------------------------------
  (                        group and capture to \5:
----------------------------------------------------------------------
    (?:                      group, but do not capture (3 times):
----------------------------------------------------------------------
      [0-9]{1,3}               any character of: '0' to '9' (between
                               1 and 3 times (matching the most
                               amount possible))
----------------------------------------------------------------------
      \.                       '.'
----------------------------------------------------------------------
    ){3}                     end of grouping
----------------------------------------------------------------------
    [0-9]{1,3}               any character of: '0' to '9' (between 1
                             and 3 times (matching the most amount
                             possible))
----------------------------------------------------------------------
  )                        end of \5
----------------------------------------------------------------------
  \s+                      whitespace (\n, \r, \t, \f, and " ") (1 or
                           more times (matching the most amount
                           possible))
----------------------------------------------------------------------
  -                        '-'
----------------------------------------------------------------------
  \s+                      whitespace (\n, \r, \t, \f, and " ") (1 or
                           more times (matching the most amount
                           possible))
----------------------------------------------------------------------
  (                        group and capture to \6:
----------------------------------------------------------------------
    [a-z]+                   any character of: 'a' to 'z' (1 or more
                             times (matching the most amount
                             possible))
----------------------------------------------------------------------
    \s                       whitespace (\n, \r, \t, \f, and " ")
----------------------------------------------------------------------
    [0-9]+                   any character of: '0' to '9' (1 or more
                             times (matching the most amount
                             possible))
----------------------------------------------------------------------
  )                        end of \6
----------------------------------------------------------------------
  \s+                      whitespace (\n, \r, \t, \f, and " ") (1 or
                           more times (matching the most amount
                           possible))
----------------------------------------------------------------------
  (                        group and capture to \7:
----------------------------------------------------------------------
    \?                       '?'
----------------------------------------------------------------------
    [^\s]+                   any character except: whitespace (\n,
                             \r, \t, \f, and " ") (1 or more times
                             (matching the most amount possible))
----------------------------------------------------------------------
  )                        end of \7
----------------------------------------------------------------------
  \s+                      whitespace (\n, \r, \t, \f, and " ") (1 or
                           more times (matching the most amount
                           possible))
----------------------------------------------------------------------
  -                        '-'
----------------------------------------------------------------------
  \s+                      whitespace (\n, \r, \t, \f, and " ") (1 or
                           more times (matching the most amount
                           possible))
----------------------------------------------------------------------
  \[                       '['
----------------------------------------------------------------------
  (                        group and capture to \8:
----------------------------------------------------------------------
    [0-9]{1,2}               any character of: '0' to '9' (between 1
                             and 2 times (matching the most amount
                             possible))
----------------------------------------------------------------------
    \/                       '/'
----------------------------------------------------------------------
    (?:                      group, but do not capture:
----------------------------------------------------------------------
      Jan                      'Jan'
----------------------------------------------------------------------
     |                        OR
----------------------------------------------------------------------
      feb                      'feb'
----------------------------------------------------------------------
     |                        OR
----------------------------------------------------------------------
      Mar                      'Mar'
----------------------------------------------------------------------
     |                        OR
----------------------------------------------------------------------
      apr                      'apr'
----------------------------------------------------------------------
     |                        OR
----------------------------------------------------------------------
      may                      'may'
----------------------------------------------------------------------
     |                        OR
----------------------------------------------------------------------
      Jun                      'Jun'
----------------------------------------------------------------------
     |                        OR
----------------------------------------------------------------------
      July                     'July'
----------------------------------------------------------------------
     |                        OR
----------------------------------------------------------------------
      Aug                      'Aug'
----------------------------------------------------------------------
     |                        OR
----------------------------------------------------------------------
      Sep                      'Sep'
----------------------------------------------------------------------
     |                        OR
----------------------------------------------------------------------
      Oct                      'Oct'
----------------------------------------------------------------------
     |                        OR
----------------------------------------------------------------------
      Nov                      'Nov'
----------------------------------------------------------------------
     |                        OR
----------------------------------------------------------------------
      Dec                      'Dec'
----------------------------------------------------------------------
    )                        end of grouping
----------------------------------------------------------------------
    \/                       '/'
----------------------------------------------------------------------
    [0-9]{4}                 any character of: '0' to '9' (4 times)
----------------------------------------------------------------------
    (?:                      group, but do not capture (3 times):
----------------------------------------------------------------------
      :                        ':'
----------------------------------------------------------------------
      [0-9]{2}                 any character of: '0' to '9' (2 times)
----------------------------------------------------------------------
    ){3}                     end of grouping
----------------------------------------------------------------------
    \s+                      whitespace (\n, \r, \t, \f, and " ") (1
                             or more times (matching the most amount
                             possible))
----------------------------------------------------------------------
    \+                       '+'
----------------------------------------------------------------------
    [0-9]{4}                 any character of: '0' to '9' (4 times)
----------------------------------------------------------------------
  )                        end of \8
----------------------------------------------------------------------
  \]                       ']'
----------------------------------------------------------------------
  \s+                      whitespace (\n, \r, \t, \f, and " ") (1 or
                           more times (matching the most amount
                           possible))
----------------------------------------------------------------------
  "                        '"'
----------------------------------------------------------------------
  (                        group and capture to \9:
----------------------------------------------------------------------
    [^"]+                    any character except: '"' (1 or more
                             times (matching the most amount
                             possible))
----------------------------------------------------------------------
  )                        end of \9
----------------------------------------------------------------------
  "                        '"'
----------------------------------------------------------------------
  \s+                      whitespace (\n, \r, \t, \f, and " ") (1 or
                           more times (matching the most amount
                           possible))
----------------------------------------------------------------------
  (                        group and capture to \10:
----------------------------------------------------------------------
    [0-9]+                   any character of: '0' to '9' (1 or more
                             times (matching the most amount
                             possible))
----------------------------------------------------------------------
  )                        end of \10
----------------------------------------------------------------------
  \s+                      whitespace (\n, \r, \t, \f, and " ") (1 or
                           more times (matching the most amount
                           possible))
----------------------------------------------------------------------
  (                        group and capture to \11:
----------------------------------------------------------------------
    [^\s]+                   any character except: whitespace (\n,
                             \r, \t, \f, and " ") (1 or more times
                             (matching the most amount possible))
----------------------------------------------------------------------
  )                        end of \11
----------------------------------------------------------------------
  \s+                      whitespace (\n, \r, \t, \f, and " ") (1 or
                           more times (matching the most amount
                           possible))
----------------------------------------------------------------------
  (                        group and capture to \12:
----------------------------------------------------------------------
    [0-9]+                   any character of: '0' to '9' (1 or more
                             times (matching the most amount
                             possible))
----------------------------------------------------------------------
  )                        end of \12
----------------------------------------------------------------------
  \s+                      whitespace (\n, \r, \t, \f, and " ") (1 or
                           more times (matching the most amount
                           possible))
----------------------------------------------------------------------
  (                        group and capture to \13:
----------------------------------------------------------------------
    [0-9a-f]+                any character of: '0' to '9', 'a' to 'f'
                             (1 or more times (matching the most
                             amount possible))
----------------------------------------------------------------------
  )                        end of \13
----------------------------------------------------------------------
  \s+                      whitespace (\n, \r, \t, \f, and " ") (1 or
                           more times (matching the most amount
                           possible))
----------------------------------------------------------------------
  "                        '"'
----------------------------------------------------------------------
  (                        group and capture to \14:
----------------------------------------------------------------------
    [^"]+                    any character except: '"' (1 or more
                             times (matching the most amount
                             possible))
----------------------------------------------------------------------
  )                        end of \14
----------------------------------------------------------------------
  "                        '"'
----------------------------------------------------------------------
  \s+                      whitespace (\n, \r, \t, \f, and " ") (1 or
                           more times (matching the most amount
                           possible))
----------------------------------------------------------------------
  "                        '"'
----------------------------------------------------------------------
  (                        group and capture to \15:
----------------------------------------------------------------------
    [^"]+                    any character except: '"' (1 or more
                             times (matching the most amount
                             possible))
----------------------------------------------------------------------
  )                        end of \15
----------------------------------------------------------------------
  "                        '"'
----------------------------------------------------------------------

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM