python pandas read_csv taking forever

Question

I'm trying to load a 128MB file using panda (after googling I found that it's faster than open or np.loadtxt). The file has 1000 lines, each one containing 65K values that are either 0 or 1 separated by a single space.

For some reason it's taking ages and I can't figure out why. 128MB sounds fairly small for me and Matlab loads it in a about a minute.

Here is my (simple) code:

import os
import numpy as np
import pandas as pd
import time

DATA_DIR='D:\BinaryDescriptors3\ORBLearningIntermediatResults2'
TEST_DIR='yosemite_harris'
OUT_DIR='D:\BinaryDescriptors3\ORBLearningTripletsFinalResults'
PATCH_NUM=1000

data_filename=TEST_DIR+'_' + str(PATCH_NUM) + '_ORBresfile.txt'

data_filepath = os.path.join(DATA_DIR,data_filename)

s=time.time()
print "START"
data =  pd.read_csv(data_filepath,delimiter=' ')

e=time.time()

print e-s

It never reached the last line (I gave it 30 minutes before terminating it). Why is reading a small, 128MB file taking so long?

EDIT:

When trying to read only one line using the following command:
data = pd.read_csv(data_filepath,delimiter=' ', nrows=1)

I get the following error:

Traceback (most recent call last):
  File "C:\eclipse\plugins\org.python.pydev_3.7.1.201409021729\pysrc\pydevd.py", line 2090, in <module>
    debugger.run(setup['file'], None, None)
  File "C:\eclipse\plugins\org.python.pydev_3.7.1.201409021729\pysrc\pydevd.py", line 1547, in run
    pydev_imports.execfile(file, globals, locals)  # execute the script
  File "D:\BinaryDescriptors3\Python\LearnTripletsOrb\LearnTripletsOrb.py", line 18, in <module>
    data =  pd.read_csv(data_filepath,delimiter=' ', nrows=1)
  File "C:\Users\GilLevi\Anaconda\lib\site-packages\pandas\io\parsers.py", line 443, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "C:\Users\GilLevi\Anaconda\lib\site-packages\pandas\io\parsers.py", line 231, in _read
    return parser.read(nrows)
  File "C:\Users\GilLevi\Anaconda\lib\site-packages\pandas\io\parsers.py", line 686, in read
    ret = self._engine.read(nrows)
  File "C:\Users\GilLevi\Anaconda\lib\site-packages\pandas\io\parsers.py", line 1130, in read
    data = self._reader.read(nrows)
  File "parser.pyx", line 727, in pandas.parser.TextReader.read (pandas\parser.c:7146)
  File "parser.pyx", line 774, in pandas.parser.TextReader._read_low_memory (pandas\parser.c:7707)
StopIteration

When trying to read a similar file that contains only one line of 65K characters, I also get the following error:
Traceback (most recent call last): File "C:\\eclipse\\plugins\\org.python.pydev_3.7.1.201409021729\\pysrc\\pydevd.py", line 2090, in debugger.run(setup['file'], None, None) File "C:\\eclipse\\plugins\\org.python.pydev_3.7.1.201409021729\\pysrc\\pydevd.py", line 1547, in run pydev_imports.execfile(file, globals, locals) # execute the script File "D:\\BinaryDescriptors3\\Python\\LearnTripletsOrb\\LearnTripletsOrb.py", line 20, in data = pd.read_csv(data_filepath,delimiter=' ', nrows=1) File "C:\\Users\\GilLevi\\Anaconda\\lib\\site-packages\\pandas\\io\\parsers.py", line 443, in parser_f return _read(filepath_or_buffer, kwds)
File "C:\\Users\\GilLevi\\Anaconda\\lib\\site-packages\\pandas\\io\\parsers.py", line 231, in _read return parser.read(nrows) File "C:\\Users\\GilLevi\\Anaconda\\lib\\site-packages\\pandas\\io\\parsers.py", line 686, in read ret = self._engine.read(nrows) File "C:\\Users\\GilLevi\\Anaconda\\lib\\site-packages\\pandas\\io\\parsers.py", line 1130, in read data = self._reader.read(nrows) File "parser.pyx", line 727, in pandas.parser.TextReader.read (pandas\\parser.c:7146) File "parser.pyx", line 774, in pandas.parser.TextReader._read_low_memory (pandas\\parser.c:7707) StopIteration
I also trying to produce a similar file that contains 2 lines of 65K but uses "," as a delimiter, and got the same error as in 1 and 2.
If load_csv is not the correct approach, can you please recommend a suitable alternative?

Answer 1

The question is old but I hope others might find the answer useful.
Pandas (less so NumPy) is optimized (and very good) at working with data that has plenty of rows and a limited number of columns (say, a few dozen tops). Your case seems to be the opposite so it is not the right tool for the task.
I would preprocess the data before loading it into a DataFrame and I would swap columns and rows in the DataFrame for further processing. So it goes something like this:

df = pd.DataFrame(columns=[i for i in range(len(txt))])
txt = open(data_filepath).readlines()
for i, ln in enumerate(txt):
  row_items = ln.split()
  df[i] = row_items
...

I believe this is going to be quite fast...

python pandas read_csv taking forever

Question

1 answers

solution1
5 2017-06-08 15:01:22

python pandas read_csv taking forever

Question

1 answers

solution1 5 2017-06-08 15:01:22

solution1
5 2017-06-08 15:01:22