简体   繁体   中英

Substring to hash key issue?

I have a log file and need to create a hash key for each URL in the record. Each line from the record has been placed into an array and I am looping through the array assigning hash keys.

I need to get from this:

"2010/08/23 15:25:35 [error]: (4: No such file or directory), clent: 80.154.42.54, server: localhost, request: "GET /logschecks/scripts/setup1.php HTTP/1.1", host: "www.example.com" 

to this:

"/logschecks/scripts/setup1.php"

I have tried using match , scan and split but they have both failed to get me where I need to go.

My method currently looks like:

def pathHistogram (rowsInFile)
  i = 0
  urlHash = Hash.new

  while i <= rowsInFile.length - 1

    urlKey = rowsInFile[i].scan(/<"GET ">/).last.first

    if urlHash.has_key?(urlKey) == true
      #get the number of stars already in there and add one. 
      urlHash[urlKey] = urlHash[urlKey] + '*'
      i = i + 1

    else 

      urlHash[urlKey] = '*'

      i = i + 1

    end
  end
end

I know that just scanning the "GET " won't complete the job but I was trying to baby-step through it. The match and split versions that I tried were fairly epic-fails, but I was likely using them incorrectly and they are long gone.

Running this script gives me an undefined method error on "first", though I have gotten other errors when I vary the way this is handled.

I should also say I am not married to using scan . If another method would work better, I would be more than happy to switch.

Any help would be greatly appreciated.

You state in a comment to the other answer the pattern is basically "GET ... HTTP , where you are interested in the ... part. That can be extracted very easily:

line = '2010/08/23 15:25:35 [error]: (4: No such file or directory), clent: 80.154.42.54, server: localhost, request: "GET /logschecks/scripts/setup1.php HTTP/1.1", host: "www.example.com"'

line[/"GET (.*?) HTTP/, 1]
# => "/logschecks/scripts/setup1.php"

Assuming each of your input lines contains /logschecks/... :

x = "2010/08/23 15:25:35 [error]: (4: No such file or directory), clent: 80.154.42.54, server: localhost, request: \"GET /logschecks/scripts/setup1.php HTTP/1.1\", host: \"www.example.com\""


x[%r(/logscheck[/\w\.]+)] # => "/logschecks/scripts/setup1.php"

Scanning HTTP logs isn't hard, but how you go about it will vary depending on the format. In the sample you're giving it's easier than a standard log because you have some landmarks you can look for:

  • Search for request: " using something like:

     /request: "\\S+ (\\S+)/i 

    That pattern will skip over GET , POST , HEAD or whatever method was used for the request.

     log_line[/request: "\\S+ (\\S+)/i, 1] # => "/logschecks/scripts/setup1.php" 

    You might want to know that if you're mining your logs. In that case...

  • Search for request: "[GET|POST|HEAD|...] using something like:

     /request: "(\\S+) (\\S+)/i 

    You'd use it like:

     method, url = log_line.match(/request: "(\\S+) (\\S+)/i).captures # => ["GET", "/logschecks/scripts/setup1.php"] method # => "GET" url # => "/logschecks/scripts/setup1.php" 
  • You can also grab whatever is inside the double-quotes , then split it to get at the parts:

     /request: "([^"]+)"/i 

    For instance:

     log_line = %[2010/08/23 15:25:35 [error]: (4: No such file or directory), clent: 80.154.42.54, server: localhost, request: "GET /logschecks/scripts/setup1.php HTTP/1.1", host: "www.example.com"] method, url, http_ver = log_line[/request: "([^"]+)"/i, 1].split # => ["GET", "/logschecks/scripts/setup1.php", "HTTP/1.1"] method # => "GET" url # => "/logschecks/scripts/setup1.php" http_ver # => "HTTP/1.1" 
  • Or use a bit more complex pattern , using some of the modern extensions and reduce the code:

     log_line = %[2010/08/23 15:25:35 [error]: (4: No such file or directory), clent: 80.154.42.54, server: localhost, request: "GET /logschecks/scripts/setup1.php HTTP/1.1", host: "www.example.com"] /request: "(?<method>\\S+) (?<url>\\S+) (?<http_ver>\\S+)"/i =~ log_line method # => "GET" url # => "/logschecks/scripts/setup1.php" http_ver # => "HTTP/1.1" 

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM