简体   繁体   English

将文本文件读入列表,然后存储在字典中会填满系统内存? (我在做什么错?)Python

[英]Reading text files into list, then storing in dictionay fills system memory ? (A what am I doing wrong?) Python

I have 43 text files that consume "232.2 MB on disk (232,129,355 bytes) for 43 items". 我有43个文本文件,它们消耗“ 43个项目在磁盘上232.2 MB(232,129,355字节)”。 what to read them in to memory (see code below). 如何将它们读入内存(请参见下面的代码)。 The problem I am having is that each file which is about 5.3mb on disk is causing python to use an additional 100mb of system memory. 我遇到的问题是磁盘上的每个文件大约5.3mb导致python使用额外的100mb系统内存。 If check the size of the dict() getsizeof() (see sample of output). 如果检查dict()getsizeof()的大小(请参见输出示例)。 When python is up to 3GB of system memory getsizeof(dict()) is only using 6424 bytes of memory. 当python最多有3GB的系统内存时,getsizeof(dict())仅使用6424字节的内存。 I don't understand what is using the memory. 我不明白什么在使用内存。

What is using up all the memory? 什么会耗尽所有内存?

The related link is different in that the reported memory use by python was "correct" related question I am not very interested in other solutions DB .... I am more interested in understanding what is happening so I know how to avoid it in the future. 相关的链接不同,因为报告的python使用的内存使用是“正确的” 相关问题,我对其他解决方案DB不太感兴趣...。我对了解正在发生的事情更感兴趣,因此我知道如何避免未来。 That said using other python built ins array rather than lists are are great suggestion if it helps. 也就是说,如果有帮助,建议使用其他python内置的ins数组而不是列表。 I have heard suggestions of using guppy to find what is using the memory. 我听说过使用guppy查找正在使用内存的建议。

sample output: 样本输出:

Loading into memory: ME49_800.txt
ME49_800.txt has 228484 rows of data
ME49_800.txt has 0 rows of masked data
ME49_800.txt has 198 rows of outliers
ME49_800.txt has 0 modified rows of data
280bytes of memory used for ME49_800.txt
43 files of 43 using 12568 bytes of memory
120

Sample data: 样本数据:

CellHeader=X    Y   MEAN    STDV    NPIXELS
  0   0 120.0   28.3     25
  1   0 6924.0  1061.7   25
  2   0 105.0   17.4     25

Code: 码:

import csv, os, glob
import sys


def read_data_file(filename):
    reader = csv.reader(open(filename, "U"),delimiter='\t')
    fname = os.path.split(filename)[1]
    data = []
    mask = []
    outliers = []
    modified = []

    maskcount = 0
    outliercount = 0
    modifiedcount = 0

    for row in reader:
        if '[MASKS]' in row:
            maskcount = 1
        if '[OUTLIERS]' in row:
            outliercount = 1
        if '[MODIFIED]' in row:
            modifiedcount = 1
        if row:
            if not any((maskcount, outliercount, modifiedcount)):
                data.append(row)
            elif not any((not maskcount, outliercount, modifiedcount)):
                mask.append(row) 
            elif not any((not maskcount, not outliercount, modifiedcount)):
                outliers.append(row)  
            elif not any((not maskcount, not outliercount, not modifiedcount)):
                modified.append(row)
            else: print '***something went wrong***'

    data = data[1:]
    mask = mask[3:]
    outliers = outliers[3:]
    modified = modified[3:]
    filedata = dict(zip((fname + '_data', fname + '_mask', fname + '_outliers', fname+'_modified'), (data, mask, outliers, modified)))
    return filedata


def ImportDataFrom(folder):

    alldata = dict{}
    infolder = glob.glob( os.path.join(folder, '*.txt') )
    numfiles = len(infolder)
    print 'Importing files from: ', folder
    print 'Importing ' + str(numfiles) + ' files from: ', folder

    for infile in infolder:
        fname = os.path.split(infile)[1]
        print "Loading into memory: " + fname

        filedata = read_data_file(infile)
        alldata.update(filedata)

        print fname + ' has ' + str(len(filedata[fname + '_data'])) + ' rows of data'
        print fname + ' has ' + str(len(filedata[fname + '_mask'])) + ' rows of masked data'
        print fname + ' has ' + str(len(filedata[fname + '_outliers'])) + ' rows of outliers'
        print fname + ' has ' + str(len(filedata[fname +'_modified'])) + ' modified rows of data'
        print str(sys.getsizeof(filedata)) +'bytes'' of memory used for '+ fname
        print str(len(alldata)/4) + ' files of ' + str(numfiles) + ' using ' + str(sys.getsizeof(alldata)) + ' bytes of memory'
        #print alldata.keys()
        print str(sys.getsizeof(ImportDataFrom))
        print ' ' 

    return alldata


ImportDataFrom("/Users/vmd/Dropbox/dna/data/rawdata")

The dictionary itself is very small - the bulk of the data is the whole content of the files stored in lists, containing one tuple per line. 字典本身很小-数据的大部分是列表中存储的文件的全部内容,每行包含一个元组。 The 20x size increase is bigger than I expected but seems to be real. 大小增加20倍比我预期的要大,但似乎是真实的。 Splitting a 27-bytes line from your example input into a tuple, gives me 309 bytes (counting recursively, on a 64-bit machine). 将示例输入中的27字节行拆分为元组,得到309字节(在64位计算机上递归计数)。 Add to this some unknown overhead of memory allocation, and 20x is not impossible. 加上一些未知的内存分配开销,并且20x并非不可能。

Alternatives: for more compact representation, you want to convert the strings to integers/floats and to tightly pack them (without all that pointers and separate objects). 替代方案:为了更紧凑地表示,您希望将字符串转换为整数/浮点数并将其紧密打包(没有所有的指针和单独的对象)。 I'm talking not just one row (although that's a start), but a whole list of rows together - so each file will be represented by just four 2D arrays of numbers. 我说的不仅是一行(尽管这是一个开始),而是整个行的列表,因此每个文件将仅由四个2D数字数组表示。 The array module is a start, but really what you need here are numpy arrays: array模块是一个开始,但实际上您需要的是numpy数组:

# Using explicit field types for compactness and access by name
# (e.g. data[i]['mean'] == data[i][2]).
fields = [('x', int), ('y', int), ('mean', float), 
          ('stdv', float), ('npixels', int)]
# The simplest way is to build lists as you do now, and convert them
# to numpy array when done.
data = numpy.array(data, dtype=fields)
mask = numpy.array(mask, dtype=fields)
...

This gives me 40 bytes spent per row (measured on the .data attribute; sys.getsizeof reports that the array has a constant overhead of 80 bytes, but doesn't see the actual data used). 这使我每行花了40个字节(根据.data属性测量; sys.getsizeof报告该数组的固定开销为80个字节,但看不到实际使用的数据)。 This is still a ~1.5 more than the original files, but should easily fit into RAM. 仍然比原始文件多1.5倍,但应该很容易装入RAM。

I see 2 of your fields are labeled "x" and "y" - if your data is dense, you could arrange it by them - data[x,y]==... - instead of just storing (x,y,...) records. 我看到您的两个字段分别标记为“ x”和“ y”-如果您的数据密集,则可以按它们进行排列-data [x,y] == ...-而不是仅存储(x,y, ...) 记录。 Besides being slightly more compact, it would be the most sensible structure, allowing easier processing. 除了稍微紧凑之外,它还将是最合理的结构,从而使加工过程更容易。

If you need to handle even more data than your RAM will fit, pytables is a good library for efficient access to compact (even compressed) tabular data in files. 如果您需要处理的数据甚至超出了RAM的pytablespytables是一个很好的库,可以有效地访问文件中的紧凑(甚至压缩)表格数据。 (It's much better at this than general SQL DBs.) (这比一般的SQL DB好得多。)

This line specifically gets the size of the function object: 此行专门获取函数对象的大小:

print str(sys.getsizeof(ImportDataFrom))

that's unlikely to be what you're interested in. 那不太可能是您感兴趣的。

The size of a container does not include the size of the data it contains. 容器的大小包括其包含的数据的大小。 Consider, for example: 考虑例如:

>>> import sys
>>> d={}
>>> sys.getsizeof(d)
140
>>> d['foo'] = 'x'*99
>>> sys.getsizeof(d)
140
>>> d['foo'] = 'x'*9999
>>> sys.getsizeof(d)
140

If you want the size of the container plus the size of all contained things you have to write your own (presumably recursive) function that reaches inside containers and digs for every byte. 如果您想要容器的大小加上所有包含的东西的大小,则必须编写自己的(可能是递归的)函数,该函数到达容器内部并挖掘每个字节。 Or, you can use third-party libraries such as Pympler or guppy . 或者,你可以使用诸如第三方库Pympler孔雀

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM