python: Improving the way I am reading a large (5GB) txt file

Question

I am actually using pandas.read_csv to read a large (5GB, ~97million rows X 7 columns) txt file with python (a point cloud).

My need is to read the first three columns (which represent x, y, z coordinates), and retrieve the bounding-box of my point cloud (in the form [x_min, y_min, z_min, x_max, y_max, z_max]).

As it stands, my code (see below) is taking hours to finish (actually it started yesterday and it did not finish yet...). The machine I am working with is a Intel Xeon CPU ES-1630 v3 @ 3.70GHz 3.70GHz. I am using python 3.6 64 bit.

A few key points of my code...

Reading the same function documentation, it reads that using usecols parameter * results in much faster parsing time and lower memory usage*. So I included only the cols I am interested in.

I am not totally aware on the real usefulness of the chunksize argument (maybe I am using it the wrong way...). As I used it, I suppose it is reading the file line by line, and maybe this is not the best approach.

Here is the code, any suggestion (also regarding other approaches than using pandas.read_csv ) would be much appreaciated.

def bounding_box(filename):
startTime = datetime.now()  # initialize counter

for row in pd.read_csv(filename, sep='\s+', header=None, chunksize=1, skiprows=1, usecols=[0, 1, 2]):
    if not 'x_min' in locals():
        x_min = row.iat[0, 0]
    if not 'y_min' in locals():
        y_min = row.iat[0, 1]
    if not 'z_min' in locals():
        z_min = row.iat[0, 2]

    if not 'x_max' in locals():
        x_max = row.iat[0, 0]
    if not 'y_max' in locals():
        y_max = row.iat[0, 1]
    if not 'z_max' in locals():
        z_max = row.iat[0, 2]

    x_min = row.iat[0, 0] if row.iat[0, 0] < x_min else x_min
    y_min = row.iat[0, 1] if row.iat[0, 1] < y_min else y_min
    z_min = row.iat[0, 2] if row.iat[0, 2] < z_min else z_min

    x_max = row.iat[0, 0] if row.iat[0, 0] > x_max else x_max
    y_max = row.iat[0, 1] if row.iat[0, 1] > y_max else y_max
    z_max = row.iat[0, 2] if row.iat[0, 2] > z_max else z_max

bbox = [x_min, y_min, z_min, x_max, y_max, z_max]
print("TIME OF PROCESSING: {}".format(datetime.now() - startTime))  # print time of execution

return bbox

Answer 1

Since I don't have a 5GB file ready for testing I can only guess that these two issues slow you down:

reading the file line by line (and converting each line to a dataframe)
complicated logic including locals() and element access for each line

To address these points increase the chunksize argument to something big that still fits into memory without paging. I imagine that chunksizes in the thounsands or even more would work well.

Then simplify (vectorize) the logic. You can easily calculate the bounding box of a chunk and then update the 'big' bounding box if it does not include all chunk bounds. Something like that:

import numpy as np
import pandas as pd

filename = 'test.csv'

bbox_min = np.zeros(3) + np.inf
bbox_max = np.zeros(3) - np.inf
for chunk in pd.read_csv(filename, sep='\s+', header=None, chunksize=10000, skiprows=1, usecols=[0, 1, 2]):
    chunkmin = chunk.values.min(axis=0)
    chunkmax = chunk.values.max(axis=0)

    bbox_min = np.minimum(bbox_min, chunkmin)
    bbox_max = np.maximum(bbox_max, chunkmax)

bbox = np.ravel([bbox_min, bbox_max])

Answer 2

Please correct me, if I've misunderstood the question. You need to calculate a "bounding box" - some minimal "box" containing all your points?

What if make min() and max() for any coordinate like this?

# some very easy DataFrame for demo
>>> df=pd.DataFrame({0:[1,2,3], 1:[3,4,5], 2:[3,4,1]})

>>> df
     0  1  2
  0  1  3  3
  1  2  4  4
  2  3  5  1

 >>> df[0].min(), df[0].max()   #  Xmin, Xmax
 (1, 3)

 >>> df[1].min(), df[1].max()   # Ymin, Ymax
 (3, 5)

 >>> df[2].min(), df[2].max()   # Zmin, Zmax
 (1, 4)

However if it a only task pandas would be "overkill". And much faster and better solution it to read a file line by line and make checks like this:

 import csv, math
 c = csv.reader(open('data/1.csv', 'r'), delimiter=',')
 xmin = +math.inf
 xmax = -math.inf

 for row in c:
     x = int(row[1])   ##   or another column
     xmin = min(xmin, x)
     xmax = max(xmax, x)
     # the same code for Y and Z

 print(xmin, xmax)

This approach has serious advantage - it reads file line-by-line after line is treated, it is thrown away. So virtually it can work with files of any length - even terabytes!

python: Improving the way I am reading a large (5GB) txt file

Question

2 answers

solution1
2 ACCPTED 2017-01-12 12:29:32

solution2
1 2017-01-12 12:07:01

python: Improving the way I am reading a large (5GB) txt file

Question

2 answers

solution1 2 ACCPTED 2017-01-12 12:29:32

solution2 1 2017-01-12 12:07:01

solution1
2 ACCPTED 2017-01-12 12:29:32

solution2
1 2017-01-12 12:07:01