简体   繁体   中英

Looking for ideas for efficient way to split large text file by number of lines in python

Iam currently trying to split the large file >200GB. Goal is to divide the large file into smaller chunks. I have written following code and it works great on smaller file. However on larger file my computer is restarting. At this point i can't figure out if it is my hardware issuse(ie processing power) or some other reason. Also looking for ideas if there is efficient way of doing same thing.

  def split(source, target, lines):
      index = 0
      block = 0
      if not os.path.exists(target):
          os.mkdir(target)
      with open(source, 'rb') as s:
          chunk = s.readlines()
          while block < len(chunk):
              with open(target+(f'file_{index:04d}.txt'), 'wb') as t:
                  t.writelines(chunk[block: block+lines])
              index+=1
              block+=lines

It's the s.readlines() that kills it since it'll try to load it all into memory.

You could do something like

with open("largeFile",'rb') as file:
    while True:
        data = file.read(1024) //blocksize

the file.read() only takes the specified blocksize, that should avoid the issue you're currently having.

EDIT:

I'm not smart, I've missed the "text file" part in your title, sorry.

In that case it should be enough to use file.readline() instead of file.readlines()

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM