请求日志解析器-文本解析

Question

I have to parse a request log that has following structure 我必须解析具有以下结构的请求日志

07/Dec/2017:18:15:58 +0100 [293920] -> GET URL HTTP/1.1
07/Dec/2017:18:15:58 +0100 [293920] <- 200 text/html 5ms
07/Dec/2017:18:15:58 +0100 [293921] -> GET URL HTTP/1.1
07/Dec/2017:18:15:58 +0100 [293921] <- 200 image/png 39ms
07/Dec/2017:18:15:59 +0100 [293922] -> HEAD URL HTTP/1.0
07/Dec/2017:18:15:59 +0100 [293922] <- 401 - 1ms
07/Dec/2017:18:15:59 +0100 [293923] -> GET URL HTTP/1.1
07/Dec/2017:18:15:59 +0100 [293923] <- 200 text/html 178ms
07/Dec/2017:18:15:59 +0100 [293924] -> GET URL HTTP/1.1
07/Dec/2017:18:15:59 +0100 [293924] <- 200 text/html 11ms
07/Dec/2017:18:15:59 +0100 [293925] -> GET URL HTTP/1.1
07/Dec/2017:18:15:59 +0100 [293925] <- 200 text/html 7ms
07/Dec/2017:18:15:59 +0100 [293926] -> GET URL HTTP/1.1
07/Dec/2017:18:15:59 +0100 [293926] <- 200 text/html 16ms
07/Dec/2017:18:15:59 +0100 [293927] -> GET URL HTTP/1.1
07/Dec/2017:18:15:59 +0100 [293927] <- 200 text/html 8ms

The output should link two lines in this log based on the number between square brackets. 输出应基于方括号之间的数字链接此日志中的两行。 The goal is to extract information from this logfile with other data processing software packages. 目的是使用其他数据处理软件包从此日志文件中提取信息。 I want to extract useful information using a csv file. 我想使用csv文件提取有用的信息。 The structure of the csv file should be as follows. csv文件的结构应如下所示。

startTimestamp,endTimestamp,requestType/responseCode,URL/typ,responsetime

07/Dec/2017:18:15:58,07/Dec/2017:18:15:58,GET,200,URL,text/html,5ms

I have made a groovyScript that does the trick but it is terribly slow. 我制作了一个能够完成上述操作的groovyScript，但是速度非常慢。

I know i can make some improvements but would like your ideas. 我知道我可以做些改进，但希望您有想法。 Some of you probably have tackled this problem in the past. 你们中有些人过去可能已经解决了这个问题。

The response does not always follow the request. 响应并不总是遵循请求。 Not every request gets a response (or is not logged due to server restart) 并非每个请求都会得到响应（或者由于服务器重新启动而未记录）

The log files can be from 70mb up to 300 mb. 日志文件的大小可以从70mb到300 mb。 My groovyScript takes a ridiculous long time. 我的groovyScript花了很长时间。

I know there are good and fast solutions in the unix terminal with awk and sort. 我知道在awk和sort的unix终端中有很好且快速的解决方案。 But have no experience with this. 但是对此没有经验。

Thanks in advance for your help 在此先感谢您的帮助

Here is the code I already have possible improvements 这是我已经有可能改进的代码

1) use map with the key being the number for faster search and less parsing 1）使用map为键，数字为键，以加快搜索速度并减少解析

2) don't go over the backlog list for every line 2）不要在每一行都查看积压列表

def logFile = new File("../request.log")
def outputfile = new File(logFile.parent, logFile.name + ".csv")
def backlog = new ArrayList<String>()
StringBuilder output = new StringBuilder()


outputfile.withPrintWriter { writer ->
    logFile.withReader { Reader reader ->
        reader.eachLine { String line ->
            Iterator<String> it = backlog.iterator()
            while (it.hasNext()) {
                String bLine = it.next()
                String[] lineSplit = line.split(" ")
                if (bLine.contains(lineSplit[2])) {
                    String[] bLineSplit = bLine.split(" ")
                    output.append(bLineSplit[0] + "," + lineSplit[0] + "," + bLineSplit[4] + "," + lineSplit[4] + "," + bLineSplit[5] + "," + lineSplit[5] + "," + lineSplit[6] + "\r\n")
                    //writer.println(outputline)
                    it.remove()
                }
            }
            backlog.add(line)
        }
    }
    writer.println(output)
    if (!backlog.isEmpty()) {
    }
    backlog.each { String line ->
        writer.println(line)
    }
}

Answer 1

As one-liner: 作为单线：

sort -k 3,3 request.log | awk 'BEGIN { print "startTimestamp;endTimestamp;requestType;responseCode;URL;typ;responsetime"; split("", request); split("", response) } $4 == "->" { printLine(); split($0, request); split("", response) } $4 == "<-" { split($0, response) } END { printLine() } function printLine() { if (length(request)) { print request[1] ";" response[1] ";" request[5] ";" response[5] ";" request[6] ";" response[6] ";" response[7] } }'

As multi-liner: 作为多线：

sort -k 3,3 request.log | awk '
    BEGIN {
        print "startTimestamp;endTimestamp;requestType;responseCode;URL;typ;responsetime"
        split("", request)
    }
    $4 == "->" {
        printLine()
        split($0, request)
        split("", response)
    }
    $4 == "<-" {
        split($0, response)
    }
    END {
        printLine()
    }
    function printLine() {
        if (length(request)) {
            print request[1] ";" response[1] ";" request[5] ";" response[5] ";" request[6] ";" response[6] ";" response[7]
        }
    }'

请求日志解析器-文本解析

问题描述

1 个解决方案

解决方案1
0 已采纳 2017-12-12 11:28:16

请求日志解析器-文本解析

问题描述

1 个解决方案

解决方案1 0 已采纳 2017-12-12 11:28:16

解决方案1
0 已采纳 2017-12-12 11:28:16