简体繁体 English

Ruby中的扩展日志文件格式解析器

[英]Extended Log File Format Parser in Ruby

原文 2010-07-27 01:57:23 1 1 ruby/ logging/ file-io/ w3c/ text-parsing

I'm looking for a ruby parser for the W3C Extended Log File Format. 我正在寻找W3C扩展日志文件格式的ruby解析器。

http://www.w3.org/TR/WD-logfile.html http://www.w3.org/TR/WD-logfile.html

Ideally it would generate a multidimensional array based on the fields in the log file. 理想情况下，它将基于日志文件中的字段生成多维数组。 I'm thinking something similar to how FasterCSV ( http://fastercsv.rubyforge.org/ ) handles CSV files. 我在想类似于FasterCSV（ http://fastercsv.rubyforge.org/ ）处理CSV文件的方式。

Does anyone know if such a library exists? 有人知道这样的图书馆是否存在吗？ If not could anyone provide advice on how I would build one? 如果没有，谁能提供关于我将如何建造一个的建议？

I am pretty sure I can figure out the string manipulation to convert the text file into an array. 我很确定我可以弄清楚将文本文件转换为数组的字符串操作。 I'm mostly concerned about handling massive log files (so potentially I'd need to stream the data back to disk or something). 我最关心的是处理海量日志文件（因此潜在地，我需要将数据流回磁盘或其他内容）。

Sincerely, Cameron 此致Cameron

1 个解决方案

Let's start with the obligatory request to see what you have tried. 让我们从强制性请求开始，看看您尝试了什么。

Scalability is a big issue when dealing with log files because they can get very big. 处理日志文件时，可伸缩性是一个大问题，因为它们会变得很大。 The extended format is smaller than the standard log format but still you have to be aware of the potential for consumption of mass quantities of RAM. 扩展格式小于标准日志格式，但是您仍然必须意识到可能消耗大量RAM。

You can use regular expressions or simple substring extracts. 您可以使用正则表达式或简单的子字符串提取。 Substring extracts are faster but lack the cool-factor. 子串提取速度更快，但缺乏凉爽因素。

require 'benchmark'

TIME_REGEX     = /(\d\d:\d\d:\d\d)/
ACTION_REGEX   = /(\w+)/
FILEPATH_REGEX = /(\S+)/

ary = %(#Version: 1.0
#Date: 12-Jan-1996 00:00:00
#Fields: time cs-method cs-uri
00:34:23 GET /foo/bar.html
12:21:16 GET /foo/bar.html
12:45:52 GET /foo/bar.html
12:57:34 GET /foo/bar.html
).split(/\n+/)

n = 50000
Benchmark.bm(6) do |x|
  x.report('regex') do
    n.times do
      ary.each do |l|
        next if l[/^#/]
        l.strip!
        # l[/^ #{ TIME_REGEX } \s #{ ACTION_REGEX } \s #{ FILEPATH_REGEX } $/ix]
        # l =~ /^ #{ TIME_REGEX } \s #{ ACTION_REGEX } \s #{ FILEPATH_REGEX } $/ix
        l =~ /^ #{ TIME_REGEX } \s #{ ACTION_REGEX } \s #{ FILEPATH_REGEX } $/iox
        timestamp, action, filepath = $1, $2, $3
      end
    end
  end

  x.report('substr') do
    n.times do
      ary.each do |l|  
        next if l[/^#/]
        l.strip!
        timestamp = l[0, 8]
        action    = l[9, 3]
        filepath  = l[14 .. -1]
      end
    end
  end
end

# >>             user     system      total        real
# >> regex   1.220000   0.000000   1.220000 (  1.235210)
# >> substr  0.800000   0.010000   0.810000 (  0.804276)

Try running the different regular expressions to see how subtle changes can make a big difference in run-time. 尝试运行不同的正则表达式，以查看细微的变化如何在运行时产生很大的不同。

In both the regex and substring versions of the benchmark code you can extract the ary.each do loops for the basis of what you are looking for. 在基准代码的正则表达式和子字符串版本中，您都可以提取ary.each do循环以寻找所需的内容。