简体   繁体   English

Python MemoryError加载文本文件

[英]Python MemoryError loading text files

I'm trying to load ~2GB of text files (approx 35K files) in my python script. 我正在尝试在python脚本中加载〜2GB的文本文件(约35K文件)。 I'm getting a memory error around a third of the way through on page.read(). 我在page.read()的三分之一途中遇到内存错误。 I' 一世'

for f in files:
    page = open(f)
    pageContent = page.read().replace('\n', '')
    page.close()

    cFile_list.append(pageContent)

I've never dealt with objects or processes of this size in python. 我从未在python中处理过这种大小的对象或进程。 I checked some of other Python MemoryError related threads but I couldn't get anything to fix my scenario. 我检查了其他一些与Python MemoryError相关的线程,但无法解决问题。 Hopefully there is something out there that can help me out. 希望那里可以帮助我。

You are trying to load too much into memory at once. 您试图一次将太多内容加载到内存中。 This can be because of the process size limit (especially on a 32 bit OS), or because you don't have enough RAM. 这可能是因为进程大小限制(尤其是在32位OS上),或者是因为您没有足够的RAM。

A 64 bit OS (and 64 bit Python) would be able to do this ok given enough RAM, but maybe you can simply change the way your program is working so not every page is in RAM at once. 只要有足够的RAM,64位OS(和64位Python)就可以做到这一点,但是也许您可以简单地更改程序的工作方式,因此并非每个页面都在RAM中。

What is cFile_list used for? cFile_list的作用是什么? Do you really need all the pages in memory at the same time? 您真的需要同时将所有页面存储在内存中吗?

Consider using generators, if possible in your case: 如果可能,请考虑使用生成器:

file_list = []
for file_ in files:
    file_list.append(line.replace('\n', '') for line in open(file_))

file_list now is a list of iterators which is more memory-efficient than reading the whole contents of each file into a string. 现在,file_list是一个迭代器列表,与将每个文件的全部内容读入字符串相比,其内存效率更高。 As soon es you need the whole string of a particular file, you can do 一旦您需要特定文件的整个字符串,就可以

string_ = ''.join(file_list[i])

Note, however, that iterating over file_list is only possible once due to the nature of iterators in Python. 但是请注意,由于Python中迭代器的性质,只能对file_list进行一次迭代。

See http://www.python.org/dev/peps/pep-0289/ for more details on generators. 有关生成器的更多详细信息,请参见http://www.python.org/dev/peps/pep-0289/

This is not effective way to read whole file in memory. 这不是读取内存中整个文件的有效方法。

Right way - get used to indexes. 正确的方法-习惯于索引。

Firstly you need to complete dictionary with start position of each line (key is line number, and value – cumulated length of previous lines). 首先,您需要完成每行起始位置的字典(键是行号,值是前几行的累积长度)。

t = open(file,’r’)
dict_pos = {}

kolvo = 0
length = 0
for each in t:
    dict_pos[kolvo] = length
    length = length+len(each)
    kolvo = kolvo+1

and ultimately, aim function: 最终,瞄准功能:

def give_line(line_number):
    t.seek(dict_pos.get(line_number))
    line = t.readline()
    return line

t.seek(line_number) – command that execute pruning of file up to line inception. t.seek(line_number)–执行对文件的修剪直到开始的命令。 So, if you next commit readline – you obtain your target line. 因此,如果您下次提交readline –您将获得目标行。 Using such approach (directly to handle to necessary position of file without running through the whole file) you are saving significant part of time and can handle huge files. 使用这种方法(直接处理文件的必要位置而无需遍历整个文件),您可以节省大量时间,并且可以处理大文件。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM