简体   繁体   中英

Loading Large File in Python

I'm using Python 2.6.2 [GCC 4.3.3] running on Ubuntu 9.04. I need to read a big datafile (~1GB, >3 million lines) , line by line using a Python script.

I tried the methods below, I find it uses a very large space of memory (~3GB)

for line in open('datafile','r').readlines():
   process(line)

or,

for line in file(datafile):
   process(line)

Is there a better way to load a large file line by line, say

  • a) by explicitly mentioning the maximum number of lines the file could load at any one time in memory ? Or
  • b) by loading it by chunks of size, say, 1024 bytes, provided the last line of the said chunk loads completely without being truncated?

Several suggestions gave the methods I mentioned above and already tried, I'm trying to see if there is a better way to handle this. My search has not been fruitful so far. I appreciate your help.

p/s I have done some memory profiling using Heapy and found no memory leaks in the Python code I am using.

Update 20 August 2012, 16:41 (GMT+1)

Tried both approach as suggested by JF Sebastian, mgilson and IamChuckB, (datafile is a variable)

with open(datafile) as f:
    for line in f:
        process(line)

Also,

import fileinput
for line in fileinput.input([datafile]):
    process(line)

Strangely both of them uses ~3GB of memory, my datafile size in this test is 765.2MB consisting of 21,181,079 lines. I see the memory get incremented along the time (around 40-80MB steps) before stabilizing at 3GB.

An elementary doubt, Is it necessary to flush the line after usage?

I did memory profiling using Heapy to understand this better.

Level 1 Profiling

Partition of a set of 36043 objects. Total size = 5307704 bytes.
 Index  Count   %     Size   % Cumulative  % Kind (class / dict of class)
     0  15934  44  1301016  25   1301016  25 str
     1     50   0   628400  12   1929416  36 dict of __main__.NodeStatistics
     2   7584  21   620936  12   2550352  48 tuple
     3    781   2   590776  11   3141128  59 dict (no owner)
     4     90   0   278640   5   3419768  64 dict of module
     5   2132   6   255840   5   3675608  69 types.CodeType
     6   2059   6   247080   5   3922688  74 function
     7   1716   5   245408   5   4168096  79 list
     8    244   1   218512   4   4386608  83 type
     9    224   1   213632   4   4600240  87 dict of type
<104 more rows. Type e.g. '_.more' to view.>

============================================================

Level 2 Profiling for Level 1-Index 0

Partition of a set of 15934 objects. Total size = 1301016 bytes.
 Index  Count   %     Size   % Cumulative  % Referred Via:
     0   2132  13   274232  21    274232  21 '.co_code'
     1   2132  13   189832  15    464064  36 '.co_filename'
     2   2024  13   114120   9    578184  44 '.co_lnotab'
     3    247   2   110672   9    688856  53 "['__doc__']"
     4    347   2    92456   7    781312  60 '.func_doc', '[0]'
     5    448   3    27152   2    808464  62 '[1]'
     6    260   2    15040   1    823504  63 '[2]'
     7    201   1    11696   1    835200  64 '[3]'
     8    188   1    11080   1    846280  65 '[0]'
     9    157   1     8904   1    855184  66 '[4]'
<4717 more rows. Type e.g. '_.more' to view.>

Level 2 Profiling for Level 1-Index 1

Partition of a set of 50 objects. Total size = 628400 bytes.
 Index  Count   %     Size   % Cumulative  % Referred Via:
     0     50 100   628400 100    628400 100 '.__dict__'

Level 2 Profiling for Level 1-Index 2

Partition of a set of 7584 objects. Total size = 620936 bytes.
 Index  Count   %     Size   % Cumulative  % Referred Via:
     0   1995  26   188160  30    188160  30 '.co_names'
     1   2096  28   171072  28    359232  58 '.co_varnames'
     2   2078  27   157608  25    516840  83 '.co_consts'
     3    261   3    21616   3    538456  87 '.__mro__'
     4    331   4    21488   3    559944  90 '.__bases__'
     5    296   4    20216   3    580160  93 '.func_defaults'
     6     55   1     3952   1    584112  94 '.co_freevars'
     7     47   1     3456   1    587568  95 '.co_cellvars'
     8     35   0     2560   0    590128  95 '[0]'
     9     27   0     1952   0    592080  95 '.keys()[0]'
<189 more rows. Type e.g. '_.more' to view.>

Level 2 Profiling for Level 1-Index 3

Partition of a set of 781 objects. Total size = 590776 bytes.
 Index  Count   %     Size   % Cumulative  % Referred Via:
     0      1   0    98584  17     98584  17 "['locale_alias']"
     1     29   4    35768   6    134352  23 '[180]'
     2     28   4    34720   6    169072  29 '[90]'
     3     30   4    34512   6    203584  34 '[270]'
     4     27   3    33672   6    237256  40 '[0]'
     5     25   3    26968   5    264224  45 "['data']"
     6      1   0    24856   4    289080  49 "['windows_locale']"
     7     64   8    20224   3    309304  52 "['inters']"
     8     64   8    17920   3    327224  55 "['galog']"
     9     64   8    17920   3    345144  58 "['salog']"
<84 more rows. Type e.g. '_.more' to view.>

============================================================

Level 3 Profiling for Level 2-Index 0, Level 1-Index 0

Partition of a set of 2132 objects. Total size = 274232 bytes.
 Index  Count   %     Size   % Cumulative  % Referred Via:
     0   2132 100   274232 100    274232 100 '.co_code'

Level 3 Profiling for Level 2-Index 0, Level 1-Index 1

Partition of a set of 50 objects. Total size = 628400 bytes.
 Index  Count   %     Size   % Cumulative  % Referred Via:
     0     50 100   628400 100    628400 100 '.__dict__'

Level 3 Profiling for Level 2-Index 0, Level 1-Index 2

Partition of a set of 1995 objects. Total size = 188160 bytes.
 Index  Count   %     Size   % Cumulative  % Referred Via:
     0   1995 100   188160 100    188160 100 '.co_names'

Level 3 Profiling for Level 2-Index 0, Level 1-Index 3

Partition of a set of 1 object. Total size = 98584 bytes.
 Index  Count   %     Size   % Cumulative  % Referred Via:
     0      1 100    98584 100     98584 100 "['locale_alias']"

Still troubleshooting this.

Do share with me if you have faced this before.

Thanks for your help.

Update 21 August 2012, 01:55 (GMT+1)

  1. mgilson, the process function is used to post process a Network Simulator 2 (NS2) tracefile. Some of the lines in the tracefile is shared as below. I am using numerous objects, counters, tuples, and dictionaries in the python script to learn how a wireless network performs.
 s 1.231932886 _25_ AGT --- 0 exp 10 [0 0 0 0 YY] ------- [25:0 0:0 32 0 0] s 1.232087886 _25_ MAC --- 0 ARP 86 [0 ffffffff 67 806 YY] ------- [REQUEST 103/25 0/0] r 1.232776108 _42_ MAC --- 0 ARP 28 [0 ffffffff 67 806 YY] ------- [REQUEST 103/25 0/0] r 1.232776625 _34_ MAC --- 0 ARP 28 [0 ffffffff 67 806 YY] ------- [REQUEST 103/25 0/0] r 1.232776633 _9_ MAC --- 0 ARP 28 [0 ffffffff 67 806 YY] ------- [REQUEST 103/25 0/0] r 1.232776658 _0_ MAC --- 0 ARP 28 [0 ffffffff 67 806 YY] ------- [REQUEST 103/25 0/0] r 1.232856942 _35_ MAC --- 0 ARP 28 [0 ffffffff 64 806 YY] ------- [REQUEST 100/25 0/0] s 1.232871658 _0_ MAC --- 0 ARP 86 [13a 67 1 806 YY] ------- [REPLY 1/0 103/25] r 1.233096712 _29_ MAC --- 0 ARP 28 [0 ffffffff 66 806 YY] ------- [REQUEST 102/25 0/0] r 1.233097047 _4_ MAC --- 0 ARP 28 [0 ffffffff 66 806 YY] ------- [REQUEST 102/25 0/0] r 1.233097050 _26_ MAC --- 0 ARP 28 [0 ffffffff 66 806 YY] ------- [REQUEST 102/25 0/0] r 1.233097051 _1_ MAC --- 0 ARP 28 [0 ffffffff 66 806 YY] ------- [REQUEST 102/25 0/0] r 1.233109522 _25_ MAC --- 0 ARP 28 [13a 67 1 806 YY] ------- [REPLY 1/0 103/25] s 1.233119522 _25_ MAC --- 0 ACK 38 [0 1 67 0 YY] r 1.233236204 _17_ MAC --- 0 ARP 28 [0 ffffffff 65 806 YY] ------- [REQUEST 101/25 0/0] r 1.233236463 _20_ MAC --- 0 ARP 28 [0 ffffffff 65 806 YY] ------- [REQUEST 101/25 0/0] D 1.233236694 _18_ MAC COL 0 ARP 86 [0 ffffffff 65 806 67 1] ------- [REQUEST 101/25 0/0] 
  1. The aim of doing 3 level profiling using Heapy is to assist me to narrow down which object(s) is eating up much of the memory. As you can see, unfortunately I could not see which one specifically need a tweaking as its too generic. Example I know though "dict of main .NodeStatistics" has only 50 objects out of 36043 (0.1%) objects, yet it takes up 12% of the total memory used to run the script, I am unable to find which specific dictionary I would need to look into.

  2. I tried implementing David Eyk's suggestion as below (snippet), trying to manually garbage collect at every 500,000 lines,

 import gc for i,line in enumerate(file(datafile)): if (i%500000==0): print '-----------This is line number', i collected = gc.collect() print "Garbage collector: collected %d objects." % (collected) 

Unfortunately, the memory usage is still at 3GB and the output (snippet) is as below,

-----------This is line number 0
Garbage collector: collected 0 objects.
-----------This is line number 500000
Garbage collector: collected 0 objects.
  1. Implemented martineau's suggestion, I see the memory usage is now 22MB from the earlier 3GB! Something that I had looking forward to achieve. The strange thing is the below,

I did the same memory profiling as before,

Level 1 Profiling

Partition of a set of 35474 objects. Total size = 5273376 bytes.
 Index  Count   %     Size   % Cumulative  % Kind (class / dict of class)
     0  15889  45  1283640  24   1283640  24 str
     1     50   0   628400  12   1912040  36 dict of __main__.NodeStatistics
     2   7559  21   617496  12   2529536  48 tuple
     3    781   2   589240  11   3118776  59 dict (no owner)
     4     90   0   278640   5   3397416  64 dict of module
     5   2132   6   255840   5   3653256  69 types.CodeType
     6   2059   6   247080   5   3900336  74 function
     7   1716   5   245408   5   4145744  79 list
     8    244   1   218512   4   4364256  83 type
     9    224   1   213632   4   4577888  87 dict of type
<104 more rows. Type e.g. '_.more' to view.>

Comparing the previous memory profiling output with the above, str has reduced 45 objects (17376 bytes), tuple has reduced 25 objects (3440 bytes) and dict(no owner) though no object change, it has reduced 1536 bytes of the memory size. All other objects are the same including dict of main .NodeStatistics. The total number of objects are 35474. The small reduction in object (0.2%) produced 99.3% of memory saving (22MB from 3GB). Very strange.

If you realize, though I know the place the memory starvation is occurring, I am yet able to narrow down which one causing the bleed.

Will continue to investigate this.

Thanks to all the pointers, using this opportunity to learn much on python as I ain't an expert. Appreciate your time taken to assist me.

Update 23 August 2012, 00:01 (GMT+1) -- SOLVED

  1. I continued debugging using the minimalistic code per martineau's suggestion. I began to add codes in the process function and observe the memory bleeding.

  2. I find the memory starts to bleed when I add a class as below,

 class PacketStatistics(object): def __init__(self): self.event_id = 0 self.event_source = 0 self.event_dest = 0 ... 

I am using 3 classes with 136 counters.

  1. Discussed this issue with my friend Gustavo Carneiro, he suggested to use slot to replace dict.

  2. I converted the class as below,

 class PacketStatistics(object): __slots__ = ('event_id', 'event_source', 'event_dest',...) def __init__(self): self.event_id = 0 self.event_source = 0 self.event_dest = 0 ... 
  1. When I converted all the 3 classes, the memory usage of 3GB before now became 504MB. A whopping 80% of memory usage saving!!

  2. The below is the memory profiling after the dict to slot convertion.

 Partition of a set of 36157 objects. Total size = 4758960 bytes. Index Count % Size % Cumulative % Kind (class / dict of class) 0 15966 44 1304424 27 1304424 27 str 1 7592 21 624776 13 1929200 41 tuple 2 780 2 587424 12 2516624 53 dict (no owner) 3 90 0 278640 6 2795264 59 dict of module 4 2132 6 255840 5 3051104 64 types.CodeType 5 2059 6 247080 5 3298184 69 function 6 1715 5 245336 5 3543520 74 list 7 225 1 232344 5 3775864 79 dict of type 8 244 1 223952 5 3999816 84 type 9 166 0 190096 4 4189912 88 dict of class <101 more rows. Type eg '_.more' to view.> 

The dict of __main__.NodeStatistics is not in the top 10 anymore.

I am happy with the result and glad to close this issue.

Thanks for all your guidance. Truly appreciate it.

rgds Saravanan K

with open('datafile') as f:
    for line in f:
        process(line)

This works because files are iterators yielding 1 line at a time until there are no more lines to yield.

The fileinput module will let you read it line by line without loading the entire file into memory. pydocs

import fileinput
for line in fileinput.input(['myfile']):
do_something(line)

Code example taken from yak.net

@mgilson 's answer is correct. The simple solution bears official mention though (@HerrKaputt mentioned this in a comment)

file = open('datafile')
for line in file:
    process(line)
file.close()

This is simple, pythonic, and understandable. If you don't understand how with works just use this.

As the other poster mentioned this does not create a large list like file.readlines(). Rather it pulls off one line at a time in the way that is traditional to unix files/pipes.

If the file is JSON, XML, CSV, genomics or any other well-known format, there are specialized readers which use C code directly and are far more optimized for both speed and memory than parsing in native Python - avoid parsing it natively whenever possible.

But in general, tips from my experience:

  • Python's multiprocessing package is fantastic for managing subprocesses, all memory leaks go away when the subprocess ends.
  • run the reader subprocess as a multiprocessing.Process and use a multiprocessing.Pipe(duplex=True) to communicate (send the filename and any other args, then read its stdout)
  • read in small (but not tiny) chunks, say 64Kb-1Mb. Better for memory usage, also for responsiveness wrt other running processes/subprocesses

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM