简体   繁体   中英

Processing apache logs quickly

I'm currently running an awk script to process a large (8.1GB) access-log file, and it's taking forever to finish. In 20 minutes, it wrote 14MB of the (1000 +- 500)MB I expect it to write, and I wonder if I can process it much faster somehow.

Here is the awk script:

#!/bin/bash

awk '{t=$4" "$5; gsub("[\[\]\/]"," ",t); sub(":"," ",t);printf("%s,",$1);system("date -d \""t"\" +%s");}' $1

EDIT:

For non-awkers, the script reads each line, gets the date information, modifies it to a format the utility date recognizes and calls it to represent the date as the number of seconds since 1970, finally returning it as a line of a .csv file, along with the IP.

Example input: 189.5.56.113 - - [22/Jan/2010:05:54:55 +0100] "GET (...)"

Returned output: 189.5.56.113,124237889

@OP, your script is slow mainly due to the excessive call of system date command for every line in the file, and its a big file as well (in the GB). If you have gawk, use its internal mktime() command to do the date to epoch seconds conversion

awk 'BEGIN{
   m=split("Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec",d,"|")
   for(o=1;o<=m;o++){
      date[d[o]]=sprintf("%02d",o)
    }
}
{
    gsub(/\[/,"",$4); gsub(":","/",$4); gsub(/\]/,"",$5)
    n=split($4, DATE,"/")
    day=DATE[1]
    mth=DATE[2]
    year=DATE[3]
    hr=DATE[4]
    min=DATE[5]
    sec=DATE[6]
    MKTIME= mktime(year" "date[mth]" "day" "hr" "min" "sec)
    print $1,MKTIME

}' file

output

$ more file
189.5.56.113 - - [22/Jan/2010:05:54:55 +0100] "GET (...)"
$ ./shell.sh    
189.5.56.113 1264110895

If you really really need it to be faster, you can do what I did. I rewrote an Apache log file analyzer using Ragel. Ragel allows you to mix regular expressions with C code. The regular expressions get transformed into very efficient C code and then compiled. Unfortunately, this requires that you are very comfortable writing code in C. I no longer have this analyzer. It processed 1 GB of Apache access logs in 1 or 2 seconds.

You may have limited success removing unnecessary printfs from your awk statement and replacing them with something simpler.

If you are using gawk , you can massage your date and time into a format that mktime (a gawk function) understands. It will give you the same timestamp you're using now and save you the overhead of repeated system() calls.

This little Python script handles a ~400MB worth of copies of your example line in about 3 minutes on my machine producing ~200MB of output (keep in mind your sample line was quite short, so that's a handicap):

import time

src = open('x.log', 'r')
dest = open('x.csv', 'w')

for line in src:
    ip = line[:line.index(' ')]
    date = line[line.index('[') + 1:line.index(']') - 6]
    t = time.mktime(time.strptime(date, '%d/%b/%Y:%X'))
    dest.write(ip)
    dest.write(',')
    dest.write(str(int(t)))
    dest.write('\n')

src.close()
dest.close()

A minor problem is that it doesn't handle timezones (strptime() problem), but you could either hardcode that or add a little extra to take care of it.

But to be honest, something as simple as that should be just as easy to rewrite in C.

gawk '{
    dt=substr($4,2,11); 
    gsub(/\//," ",dt); 
    "date -d \""dt"\" +%s"|getline ts; 
    print $1, ts
}' yourfile

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM