Huge cost of Memory in Python

Question

I am writing a piece of code that use Objects in Python. I have 1.1GB of files which needs to be parsed and transformed to Objects.

However, with 1.1GB of files , it consumes more than 7GB of memory (and I stopped it, because it can go further...), which is quite big. I used memory profiler to inspect and see what is going on, and... there is the example of the result i had:

Line #    Mem usage    Increment   Line Contents
================================================
78   30.352 MiB    0.000 MiB   @profile
79                             def getInfos(listExch):
80
81   30.352 MiB    0.000 MiB    SizeTot = 0
82   30.352 MiB    0.000 MiB    upListExch = set()
83
84 5325.996 MiB 5295.645 MiB    for exch in listExch:
85
86
87 5325.996 MiB    0.000 MiB        symbExch = exch.symb
88 5325.996 MiB    0.000 MiB        nameExch = exch.name
89 5325.996 MiB    0.000 MiB        stList = exch.getStList()
90 5325.996 MiB    0.000 MiB        upExch = Exch(symbExch,nameExch)
91
92 7572.309 MiB 2246.312 MiB        for st in stList:
93
94 7572.309 MiB    0.000 MiB            unexpected = False
95 7572.309 MiB    0.000 MiB            symbSt = st.symb
96
97 7572.309 MiB    0.000 MiB            filepath = '{0}/{1}.csv'.format(download_path,symbSt)
98
99 7572.309 MiB    0.000 MiB            upSt = parseQ(st,filepath)
100 7572.309 MiB    0.000 MiB               upExch.addSt(upSt)
101 5325.996 MiB -2246.312 MiB      upListExch.add(upExch)
102
103                                 return upListExch

There is also the objects models I wrote below:

Exch is an object which contains a listSt , and each St contains a listQ of objects.

class Exch:
    def __init__(self,symb,name):
        self.symb = symb
        self.name = name
        self.listSt = set()

    def addSt(self,st):
        self.listSt.add(st)

    def setStList(self,listSt):
        self.listSt = listSt

    def getStList(self):
        return self.listSt

class St:
    def __init__(self,symb,name):
        self.symb = symb
        self.name = name
        self.listQ = set()

    def getQList(self):
        return self.listQ

    def addQ(self,q):
        self.listQ.add(q)

class Q:

    def __init__(self,date,dataH,dataM,dataL):

            self.date = date
            self.dataH = dataH
            self.dataM = dataM
            self.dataL = dataL

Did I do something wrong here? Or isn't Python adapted with this amount of data?

EDIT:

The input listExch contains a list of Exch object, and each st into listSt contains an empty listQ

The output will be the same as the input, except each listQ in each st objects will be added.

There is the parser made:

def parseQ(st,filepath):

    loc_date,loc_dataH,loc_dataM,loc_dataL = 0,0,0,0

    with open (filepath, 'rt') as csvfile:
            reader = csv.reader (csvfile,delimiter=',')
            row1 = next(reader)
            unexpected = False

            for idx,el in enumerate(row1):
                    if (el == 'Date'):
                            loc_date = idx
                    elif (el == 'Number High'):
                            loc_dataH = idx
                    elif (el == 'Number Medium'):
                            loc_dataM = idx
                    elif (el == 'Number Low'):
                            loc_dataL = idx
                    else:
                            log.error('Unexpected format on file {}. Skip the file'.format(filepath))
                            unexpected = True
                            break
            if (unexpected):
                    log.error('The file "{}" is not properly set'.format(filepath))
                    return False
            else:
                    next(reader)
                    for row in reader:
                            try:
                                st.addQ(Q(row[loc_date],row[loc_dataH],row[loc_dataM],row[loc_dataL]))
    return st

Answer 1

I'm not surprised the least.

Reading in a CSV generates a list of rows, each of which points to a list of elements.

Now, each of those is a PyObject , meaning that it has a typeref, which uses size_t , usually, I think, and the list containing it has to contain its id (which, coincidentially, is just a pointer to the PyObject ), so that's two size_t , ie the size of your pointer type, just for the fact that there is an element. That's not even considering the fact that the element's "payload" will need a bit of memory, too!

On a 64bit machine, this will be 128bit in pure structure overhead per element . I don't know how your elements look like, but it's pretty possible that this is more than the actual content.

Generally, don't do this. If your data is tabular, load it with numpy, which will not have python lists of lists, but just allocate a big area of memory to dump the raw values in and calculate individual elements addresses when accessing them, rather than going the Python route of hopping from pointer to pointer. That way, you'll also win a lot of speed.

Let me also mention that CSV is an especially poor format for storage of large amounts of data. There's no formal definition of the language (that's why python's CSV reader has the concept of "dialects"), it's horribly inefficient (and potentially inprecise) at storing floating point numbers, there's no chance to access the Nth row without reading all N-1 previous rows, it depends on string parsing, can't be used to modify values in-place unless that doesn't change the length of the string... all in all: you're doing well if you read in these files once, and convert them into some actually storage-wise tabular format.

The "but CSV is plaintext and I can read it with my text editor" argument doesn't really count – no-one is able to "quickly click through" 1GB of CSV. So, try to get rid of the CSV files. Numpy, again has a native storage format which probably is OK for your purposes, or use HDF or any other standard format – or if all elements of your CSV are of the same type, you could also just save them as raw byte image of your data – that would be the fastest and most space-efficient method of storage (yet you'll have to "externally" remember the structure of the data), sparsity aside.

EDIT : As OP points out, exactly that is his plan: reading the CSV, validating its content, and then storing it in a database! Good style.

Now, reading can happen row-wise, so you can read a row (or a couple of rows), store the data in the database, forget the rows, get the next, and so on. Validation can happen on the data stored in the database, maybe in a separate "staging" part of the database.

Huge cost of Memory in Python

Question

1 answers

solution1
4 ACCPTED 2016-06-04 16:54:10

Huge cost of Memory in Python

Question

1 answers

solution1 4 ACCPTED 2016-06-04 16:54:10

solution1
4 ACCPTED 2016-06-04 16:54:10