解析日志文件以提取查询

Question

I want to extract certain URL from a log file. 我想从日志文件中提取某些URL。 But I only want to extract those queries that were ranked 1. or 2 . 但我只想提取排名为1或2的那些查询。 The log file contains a colum itemRank , giving the rank. 日志文件包含一个colum itemRank ，给出排名。 So far I was able to extract certain URL by scanning through the text. 到目前为止，我能够通过浏览文本来提取某些URL。 But I do not know how to implement the condition that the URL is only clicked first or second. 但是我不知道如何实现只单击URL或单击URL的条件。

For example , this is how part of the log file looks like: 例如，这是日志文件一部分的样子：

(columns are ID,date, time, RANK, url) （列为ID，日期，时间，RANK，URL）

763570 2006-03-06 14:09:48 2 http://something.com 763570 2006-03-06 14:09:48 2 http://something.com

763570 2006-03-06 14:09:48 3 http://something.com 763570 2006-03-06 14:09:48 3 http://something.com

Here I just want to extract the first query, because it was ranked 2. 在这里，我只想提取第一个查询，因为它的排名为2。

This is my code so far: 到目前为止，这是我的代码：

public class Scanner {

    public static void main(String[] args) throws FileNotFoundException {


        File testFile = new File ("C:/Users/Zyaad/logs.txt");
        Scanner s = new Scanner(testFile);
        int count=0;

        String pattern="http://ontology.buffalo.edu";
        while(s.hasNextLine()){
            String line = s.nextLine();

            if (line.contains(pattern)){
                count++;

                System.out.println(count + ".query: " );
                System.out.println(line);
            } 

        }   System.out.println("url was clicked: "+ count + " times");

        s.close();

        }
}

What can I do to just print out the 1. query? 我应该怎么做才能打印出1.查询？ I tried regex like [\\t\\n\\b\\r\\f] [1,2]{1}[\\t\\n\\b\\r\\f] but this didn't work. 我尝试了[\\t\\n\\b\\r\\f] [1,2]{1}[\\t\\n\\b\\r\\f]这样的正则表达式，但这没用。

Answer 1

A simple (possibly simplistic) approach would be to: 一个简单的（可能是简单的）方法是：

Determine the number(s) (severity?) you're looking for 确定您要查找的电话号码（严重性？）
Determine a starting pattern for your URL 确定网址的起始格式

Example 例

// assume this is the file you're parsing so I don't have to repeat 
// the whole Scanner part here
String theFile = "763570 2006-03-06 14:09:48 2 http://something2.com\r\n" +
        "763570 2006-03-06 14:09:48 3 http://something3.com";
//                           | your starting digit of choice
//                           | | one white space
//                           | | | group 1 start
//                           | | | | partial protocol of the URL
//                           | | | |  | any character following in 1+ instances
//                           | | | |  | | end of group 1
//                           | | | |  | | 
Pattern p = Pattern.compile("2\\s(http.+)");
Matcher m = p.matcher(theFile);
while (m.find()) {
    // back-referencing group 1
    System.out.println(m.group(1));
}

Output 产量

http://something2.com

Note 注意

Parsing log files with regex is generally advised against. 通常建议不要使用正则表达式来解析日志文件。

You'd probably be better off long-term implementing your own parser and itemize tokens as properties of objects (1 per line I assume), then manipulate those as desired. 你可能会更好的长期实现自己的解析器和逐项标记为对象（1每行我假设）的属性，然后操纵这些根据需要。

Answer 2

You can create a regex based on date & time pattern or you can simply start it from time pattern as well. 您可以根据日期和时间模式创建一个正则表达式，也可以仅从时间模式启动它。

yyyy-MM-dd hh:mm:ss 1|2

Date & Time pattern followed by 1 or 2 日期和时间模式，后跟1或2

\d{4}-\d{2}-\d{2}\s\d{2}:\d{2}:\d{2}\s[1|2]\s

Time pattern followed by 1 or 2 时间模式后跟1或2

\d{2}:\d{2}:\d{2}\s[1|2]\s

Sample code: 样例代码：

String[] str=new String[] { "763570 2006-03-06 14:09:48 2 http://something.com",
        "763570 2006-03-06 14:09:48 3 http://something.com" };

Pattern p = Pattern
          .compile("\\d{4}-\\d{2}-\\d{2}\\s\\d{2}:\\d{2}:\\d{2}\\s[1|2]\\s");
for (String s : str) {
    Matcher m = p.matcher(s);
    if (m.find()) {
        System.out.println(s.substring(m.end()));
    }
}

Answer 3

You can find here some useful patterns . 您可以在这里找到一些有用的模式。 If it's possible to use other tools, i will suggest using logstash , an impressive tool for collecting and parsing log. 如果可以使用其他工具，我建议您使用logstash ，这是一个用于收集和解析日志的令人印象深刻的工具。

Answer 4

You can extract URLs ranked 1 or 2 like this: 您可以像这样提取排名1或2的网址：

/(?<=\s(?:1|2)\s).*$/

It will grab the last part of the line if the URL is preceded with either 1 or 2. 如果URL前面带有1或2，它将捕获该行的最后一部分。

Answer 5

Try this: 尝试这个：

public static void main(String[] args) throws FileNotFoundException {

    int count = 0;
    // create date pattern
    // source:https://github.com/elasticsearch/logstash/blob/master/patterns/grok-patterns
    String yearPattern = "(?>\\d\\d){1,2}";
    String monthNumPattern = "(?:0?[1-9]|1[0-2])";
    String monthDayPattern = "(?:(?:0[1-9])|(?:[12][0-9])|(?:3[01])|[1-9])";
    String hourPattern = "(?:2[0123]|[01]?[0-9])";
    String minutePattern = "(?:[0-5][0-9])";
    String secondPattern = "(?:(?:[0-5]?[0-9]|60)(?:[:.,][0-9]+)?)";
    String datePattern = String.format("%s-%s-%s %s:%s:%s", yearPattern,
            monthNumPattern, monthDayPattern, hourPattern, minutePattern,
            secondPattern);

    // create url pattern
    // source: http://code.tutsplus.com/tutorials/8-regular-expressions-you-should-know--net-6149
    String urlPattern = "(https?://)?([\\da-z\\.-]+)\\.([a-z\\.]{2,6})([/\\w \\.-]*)*/?";
    Pattern pattern = Pattern.compile("(\\d+) (" + datePattern
            + ") (\\d+) (" + urlPattern + ")");
    String data = "763570 2006-03-06 14:09:48 3 http://something.com\n"
            + "763570 2006-03-06 14:09:48 2 http://something.com\n"
            + "763570 2006-03-06 14:09:48 1 http://something.com";
    ByteArrayInputStream is = new ByteArrayInputStream(data.getBytes());
    java.util.Scanner s = new java.util.Scanner(is);
    while (s.hasNextLine()) {
        String line = s.nextLine();
        Matcher matcher = pattern.matcher(line);
        if (matcher.matches()) {
            if (matcher.find(3)) {
                int rank = Integer.parseInt(matcher.group(3));
                if (rank == 1 || rank == 2) {
                    count++;
                }
            }
        }
    }
    System.out.println("url was clicked: " + count + " times");

    s.close();

}

this will output "url was clicked: 2 times" for file containing: 对于包含以下内容的文件，这将输出“ URL被单击：2次”

763570 2006-03-06 14:09:48 3 http://something.com 763570 2006-03-06 14:09:48 3 http://something.com
763570 2006-03-06 14:09:48 2 http://something.com 763570 2006-03-06 14:09:48 2 http://something.com
763570 2006-03-06 14:09:48 1 http://something.com 763570 2006-03-06 14:09:48 1 http://something.com

解析日志文件以提取查询

问题描述

5 个解决方案

解决方案1
1 2014-05-19 21:22:18

解决方案2
0 2014-05-19 21:27:15

解决方案3
0 2014-05-19 21:51:28

解决方案4
0 2014-05-20 00:49:02

解决方案5
0 2014-05-25 01:07:24

解析日志文件以提取查询

问题描述

5 个解决方案

解决方案1 1 2014-05-19 21:22:18

解决方案2 0 2014-05-19 21:27:15

解决方案3 0 2014-05-19 21:51:28

解决方案4 0 2014-05-20 00:49:02

解决方案5 0 2014-05-25 01:07:24

解决方案1
1 2014-05-19 21:22:18

解决方案2
0 2014-05-19 21:27:15

解决方案3
0 2014-05-19 21:51:28

解决方案4
0 2014-05-20 00:49:02

解决方案5
0 2014-05-25 01:07:24