简体   繁体   中英

How to read 6GB logfile in python , without loading the entire file into memory first?

I want to read a big log file(6GB) in buffer , I mean read 100 MB then sleep for few second, and also I want to prevent to load file content in the memory, I want to read it like head -nx in bash, also the file is include blocks, each block contain many lines, and between each block there is 3 blank line, for example :

[18/05/2015:00:00:00 +0300]%PARSER_ERROR[elapsedTime]
GET /mobile/ HTTP/1.1
host: www.my-host.com:8082
accept: */*
accept-language: en-gb
connection: keep-alive
accept-encoding: gzip, deflate
user-agent: Mozilla/5.0 (iPhone; CPU iPhone OS 8_2 like Mac OS X) AppleWebKit/600.1.4 (KHTML, like Gecko) Mobile/12D508
x-sub-imsi: 418876678
x-sub-msisdn: 333123654



[18/05/2015:00:00:00 +0300]%PARSER_ERROR[elapsedTime]
GET / HTTP/1.1
content-type: application/x-www-form-urlencoded
user-agent: Dalvik/1.6.0 (Linux; U; Android 4.4.2; AirPhoneS6 Build/KOT49H)
host: www.my-host.net
connection: Keep-Alive
accept-encoding: gzip
x-sub-imsi: 418252632
x-sub-msisdn: 333367627836



HTTP/1.1 302 Found
Location: http://www.my-host.net/welcome/main.html
Set-Cookie: oam.Flash.RENDERMAP.TOKEN=-jdrkoipfe; Path=/



[18/05/2015:00:00:00 +0300]%PARSER_ERROR[elapsedTime]
GET / HTTP/1.1
content-type: application/x-www-form-urlencoded
user-agent: Dalvik/1.6.0 (Linux; U; Android 4.4.2; AirPhoneS6 Build/KOT49H)
host: www.my-host.net
connection: Keep-Alive
accept-encoding: gzip
x-sub-imsi: 41887237832
x-sub-msisdn: 333878778

I want to export user-agent and its msisdn and the platform version to csv file, so I am going to generate 2 file, ios.cs and android.csv, and each file will contain uniq msisdn the file will be like: user-agent, version, msisdn example: Android, 4.2.2, 333878778

So I have to check block by block and then check the user-agent line, and then its msisdn. I tried it to do it in bash, but since bash is not that much flexible, so I decide to do it in python

You can use fileinput library which provides an iterator, so I don't think it would load whole file into memory, unless you make it do that.

import fileinput
import time

file = fileinput.input('my_log_file.txt')

for line in file:
    # do your computation
    time.sleep(5)
def readFile(inputFile):
    file_object = open(inputFile, 'rb')
    buff = int(1E6) #100 Megabyte
    while True:
        block = file_object.read(buff)
        if not buff: time.sleep(3)
        doSomeThing(block)
        block = file_object.read(buff)
    file_object.close()


# time python readfile.py

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM