简体   繁体   English

寻找在 python 中按行数拆分大型文本文件的有效方法的想法

[英]Looking for ideas for efficient way to split large text file by number of lines in python

Iam currently trying to split the large file >200GB.我目前正在尝试拆分大于 200GB 的大文件。 Goal is to divide the large file into smaller chunks.目标是将大文件分成更小的块。 I have written following code and it works great on smaller file.我编写了以下代码,它适用于较小的文件。 However on larger file my computer is restarting.但是,在较大的文件中,我的计算机正在重新启动。 At this point i can't figure out if it is my hardware issuse(ie processing power) or some other reason.在这一点上,我无法弄清楚这是我的硬件问题(即处理能力)还是其他一些原因。 Also looking for ideas if there is efficient way of doing same thing.如果有有效的方法来做同样的事情,也会寻找想法。

  def split(source, target, lines):
      index = 0
      block = 0
      if not os.path.exists(target):
          os.mkdir(target)
      with open(source, 'rb') as s:
          chunk = s.readlines()
          while block < len(chunk):
              with open(target+(f'file_{index:04d}.txt'), 'wb') as t:
                  t.writelines(chunk[block: block+lines])
              index+=1
              block+=lines

It's the s.readlines() that kills it since it'll try to load it all into memory.杀死它的是s.readlines() ,因为它会尝试将其全部加载到 memory 中。

You could do something like你可以做类似的事情

with open("largeFile",'rb') as file:
    while True:
        data = file.read(1024) //blocksize

the file.read() only takes the specified blocksize, that should avoid the issue you're currently having. file.read()仅采用指定的块大小,这应该可以避免您当前遇到的问题。

EDIT:编辑:

I'm not smart, I've missed the "text file" part in your title, sorry.我不聪明,我错过了你标题中的“文本文件”部分,对不起。

In that case it should be enough to use file.readline() instead of file.readlines()在这种情况下,使用file.readline()而不是file.readlines()就足够了

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM