简体   繁体   中英

Parsing a big text file and extract data without looping more than once - Python

I want to parse a large text log file (around 1M lines). Example below:

2016-11-08 03:49.879 alfa: (157) all is good

2016-11-08 03:49.979 alfa: (157) there is an ERROR here

2016-11-08 03:50.879 gamma: (2) something else is here

2016-11-08 03:51.879 delta: (69) something is going on

What I want to achieve is look for errors and then return all lines related to that error - alfa in this case. The problem is when I loop the first time and find error, then save alfa (157) as reference, how do I then return all alfa (157) lines (even those which happend before the error like in example) without looping through the 1M lines again. What if there are 50 errors? Is this possible? Is it an O(n2) problem?

I wanted to use Python:

def analyze_log(f):
    for line in f:
        (..)

1M lines is not that big for modern hardware, I would assemble an in-memory database using a dict. Something like:

log_database = {}
for i, line in enumerate(logfile):
    date, time, label, message = line.split(None, 3)
    log_database.setdefault(label, []).append({
        "line number": i,
        "date": date,
        "time": time,
        "message": message,
    })

I would suggest that you build a pipeline, that way you can perform multiple operations on every line. If you want to get fancier you could even build it using coroutines and then run in parallel asynchronously.

def has_errors(line):
    return True if ('alfa' in line and 'ERROR' in line) else False

def do_something(line):
    # add your processing logic
    processed = line
    return processed

errors = list()
processed = list()

with open('kwyjibo.log') as log_file:
    for line in log_file:
        if has_errors(line):
            errors.append(line)
        processed.append(do_something(line))

# contents of kwyjibo.log
# 2016-11-08 03:49.879 alfa: (157) all is good
# 2016-11-08 03:49.979 alfa: (157) there is an ERROR here
# 2016-11-08 03:50.879 gamma: (2) something else is here
# 2016-11-08 03:51.879 delta: (69) something is going on

# Output
# In [3]: errors
# Out[3]: ['2016-11-08 03:49.979 alfa: (157) there is an ERROR here\n']

# In [4]: processed
# Out[4]:
# ['2016-11-08 03:49.879 alfa: (157) all is good\n',
# '2016-11-08 03:49.979 alfa: (157) there is an ERROR here\n',
# '2016-11-08 03:50.879 gamma: (2) something else is here\n',
# '2016-11-08 03:51.879 delta: (69) something is going on\n']

You could append all lines from error 157, or any orher error, in the same dict key:

log_errors = {}
...
if log_errors.has_key(error_key):
    log_errors[error_key].append(line_from_log)
else:
    log_errors[error_key] = line_from_log

PS. has_key() has been removed from python 3, use 'in' operator instead.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM