简体   繁体   中英

Parsing large text file result in “Killed” error

I'm trying to parse a very large file (file > 4G) containing WHOIS information.

I only need a subset of the information contained in the file.

The goal is to output in a JSON format some WHOIS fields of interest.

#
# The contents of this file are subject to
# RIPE Database Terms and Conditions
#
# http://www.ripe.net/db/support/db-terms-conditions.pdf
#

inetnum:        10.16.151.184 - 10.16.151.191
netname:        NETECONOMY-MG41731 ENTRY 1
descr:          DUMMY FOO ENTRY 1
country:        IT ENTRY 1
admin-c:        DUMY-RIPE
tech-c:         DUMY-RIPE
status:         ASSIGNED PA
notify:         neteconomy.rete@example.com
mnt-by:         INTERB-MNT
changed:        unread@xxx..net 20000101
source:         RIPE
remarks:        ****************************
remarks:        * THIS OBJECT IS MODIFIED
remarks:        * Please note that all data that is generally regarded as personal
remarks:        * data has been removed from this object.
remarks:        * To view the original object, please query the RIPE Database at:
remarks:        * http://www.ripe.net/whois
remarks:        ****************************

% Tags relating to '80.16.151.184 - 80.16.151.191'
% RIPE-USER-RESOURCE

inetnum:        20.16.151.180 - 20.16.151.183
netname:        NETECONOMY-MG41731 ENTRY 2
descr:          DUMMY FOO ENTRY 2
country:        IT ENTRY 2
admin-c:        DUMY-RIPE
tech-c:         DUMY-RIPE
status:         ASSIGNED PA
notify:         neteconomy.rete@xxx.it
mnt-by:         INTERB-MNT
changed:        unread@xxx.net 20000101
source:         RIPE
remarks:        ****************************
remarks:        * THIS OBJECT IS MODIFIED
remarks:        * Please note that all data that is generally regarded as personal
remarks:        * data has been removed from this object.
remarks:        * To view the original object, please query the RIPE Database at:
remarks:        * http://www.ripe.net/whois
remarks:        ****************************

I'm doing the parsing and the information retrieval using the code below, which I'm sure is far from being optimized and I could achieve similar result in a more efficient way.

def create_json2():
    regex_inetnum = r'inetnum:\s+(?P<inetnum_val>.*)'
    regex_netname = r'netname:\s+(?P<netname_val>.*)'
    regex_country = r'country:\s+(?P<country_val>.*)'
    regex_descr = r'descr:\s+(?P<descr_val>.*)'
    inetnum_list = []
    netname_list = []
    country_list = []
    descr_list = []
    records = []
    with open(RIPE_DB, "r") as f:
        for line in f:
            inetnum = re.search(regex_inetnum, line, re.IGNORECASE)
            netname = re.search(regex_netname, line, re.IGNORECASE)
            country = re.search(regex_country, line, re.IGNORECASE)
            descr = re.search(regex_descr, line, re.IGNORECASE)
            if inetnum is not None:
                inetnum_val = inetnum.group("inetnum_val").strip()
                inetnum_list.append(inetnum_val)
            if netname is not None:
                netname_val = netname.group("netname_val").strip()
                netname_list.append(netname_val)
            if country is not None:
                country_val = country.group("country_val").strip()
                country_list.append(country_val)
            if descr is not None:
                descr_val = descr.group("descr_val").strip()
                descr_list.append(descr_val)

        for i,n,d,c in zip(inetnum_list, netname_list, descr_list, country_list):
            data = {'inetnum': i, 'netname': n.upper(), 'descr': d.upper(), 'country': c.upper()}
            records.append(data)   
    print json.dumps(records, indent=4)

create_json2()

When I start parsing the file it stops after a while with the following error.

$> ./parse.py
Killed

The RAM/CPU load are quite high during the file processing.

The same code works as expected and without error on smaller files.

Do you have advice in order to be able to parse this over 4G file and also improve the code logic and quality?

The magic word is "Flush", you need to get that data out of Python ASAP (preferably in a batch way).

#!/usr/bin/env python

import shelve

db = shelve.open('ipnum.db')

def split_line(line):
    line = line.split(':')
    key = line[0]
    value = ':'.join(line[1:]).strip()
    return key, value

def parse_entry(f):
    entry = {}
    for line in f:
        line = line.strip()
        if len(line) < 5:
            break

        key, value = split_line(line)
        if key not in entry:
            entry[key] = value
        elif key in entry:
            if not isinstance(entry[key], list):
                entry[key] = [entry[key]]
            entry[key].append(value)

    return entry

def parse_file(file_path):
    i = 0
    with open(file_path) as f:
        for line in f:
            if line.startswith('inetnum'):
                inetnum = split_line(line)[1]
                entry = parse_entry(f)
                db[inetnum] = entry

                if i == 250000:
                    print 'done with 250k'
                    db.sync()
                    i = 0

                i += 1

    db.close()

if __name__ == '__main__':
    parse_file('ripe.db.inetnum')

This script will save the whole database into a database called ipnum.db, you can easily change the output target and how often it should be flushed.

The db.sync() is a bit for show as bsddb is autoflushed with these quantities of data.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM