Avoiding loops in python when using classes

Question

I am using a python class that I'm using when filling using a for loop, and it is very slow when I'm looping through millions of data lines, and there obviously is a faster way. Perhaps I shouldn't be using a class at all, but I need to create a structure so that I can sort it.

Here is that class:

class Particle(object):
def __init__(self, ID, nH, T, metallicity,oxygen,o6,o7,o8):
    self.ID = ID
    self.nH = nH
    self.T = T
    self.metallicity = metallicity
    self.oxygen = oxygen
    self.o6 = o6
    self.o7 = o7
    self.o8 = o8

and here is how I first filled it up after reading all the individual arrays (ID, nH, T, etc.) using append, which is of course exceedingly slow:

partlist = []

for i in range(npart):
     partlist.append(Particle(int(ID[i]),nH[i],T[i],metallicity[i],oxygen[i],o6[i],o7[i],o8[i]))

This takes a couple hours for 30 million values, and obviously 'append' is not the right way to do it. I thought this was an improvement:

partlist = [Particle(int(ID[i]),nH[i],T[i],metallicity[i],oxygen[i],o6[i],o7[i],o8[i]) for i in range(npart)]

but this is taking probably just as long and hasn't finished after an hour.

I'm new to python, and using indexes is not "pythonic", but I am at a loss on how to make and fill up a python object in what should only take a few minutes.

Suggestions? Thanks in advance.

Answer 1

Use the correct tool for the job:

You need to research more efficient data structures to begin with. Regular objects are not going to be the best solution for what you are trying to do if you need the entire dataset in memory at once.

use `xrange()` instead of `range()`

range(30000000) creates a list of 30,000,000 numbers in memory, xrange() doesn't it evaluates like a generator would.

.

Use `numpy` to store and process the data in arrays efficiently.

Besides its obvious scientific uses, NumPy can also be used as an efficient multi-dimensional container of generic data. Arbitrary data-types can be defined. This allows NumPy to seamlessly and speedily integrate with a wide variety of databases.

Processing:

Research stream processing and Map/Reduce approaches to processing the data. If you can avoid loading the entire data set into memory and process it as it is read you can avoid all the object creation and list building completely.

Past this 30,000,000 of something is 30,000,000 of something and if you do not have enough RAM for this in memory it is just going to swap to disk and grind away. But there is not enough information to know if you need the entire thing in a giant list to begin with.

Answer 2

Thanks for the answers. Jarrod's point about using the multi-dimensional arrays in numpy was the most helpful. Here's what I have that works 40x faster now:

parttype = [('ID', int), ('nH', float), ('T', float), ('metallicity', float), ('oxygen', float), ('o6', float), ('o7', float), ('o8', float)]

partlist = np.zeros((npart,), dtype=parttype)

for i in xrange(npart):
    partlist[i] = (int(ID[i]),nH[i],T[i],metallicity[i],oxygen[i],o6[i],o7[i],o8[i])

Still a for loop, but works reasonably fast for my data (6 mins. vs. 4 hours)!

Avoiding loops in python when using classes

Question

2 answers

solution1
3

Use the correct tool for the job:

use `xrange()` instead of `range()`

Use `numpy` to store and process the data in arrays efficiently.

Processing:

solution2
0 2015-11-22 05:09:31

Avoiding loops in python when using classes

Question

2 answers

solution1 3

Use the correct tool for the job:

use xrange() instead of range()

Use numpy to store and process the data in arrays efficiently.

Processing:

solution2 0 2015-11-22 05:09:31

solution1
3

use `xrange()` instead of `range()`

Use `numpy` to store and process the data in arrays efficiently.

solution2
0 2015-11-22 05:09:31