I'm trying to load a 128MB file using panda (after googling I found that it's faster than open or np.loadtxt). The file has 1000 lines, each one containing 65K values that are either 0 or 1 separated by a single space.
For some reason it's taking ages and I can't figure out why. 128MB sounds fairly small for me and Matlab loads it in a about a minute.
Here is my (simple) code:
import os
import numpy as np
import pandas as pd
import time
DATA_DIR='D:\BinaryDescriptors3\ORBLearningIntermediatResults2'
TEST_DIR='yosemite_harris'
OUT_DIR='D:\BinaryDescriptors3\ORBLearningTripletsFinalResults'
PATCH_NUM=1000
data_filename=TEST_DIR+'_' + str(PATCH_NUM) + '_ORBresfile.txt'
data_filepath = os.path.join(DATA_DIR,data_filename)
s=time.time()
print "START"
data = pd.read_csv(data_filepath,delimiter=' ')
e=time.time()
print e-s
It never reached the last line (I gave it 30 minutes before terminating it). Why is reading a small, 128MB file taking so long?
EDIT:
When trying to read only one line using the following command:
data = pd.read_csv(data_filepath,delimiter=' ', nrows=1)
I get the following error:
Traceback (most recent call last):
File "C:\eclipse\plugins\org.python.pydev_3.7.1.201409021729\pysrc\pydevd.py", line 2090, in <module>
debugger.run(setup['file'], None, None)
File "C:\eclipse\plugins\org.python.pydev_3.7.1.201409021729\pysrc\pydevd.py", line 1547, in run
pydev_imports.execfile(file, globals, locals) # execute the script
File "D:\BinaryDescriptors3\Python\LearnTripletsOrb\LearnTripletsOrb.py", line 18, in <module>
data = pd.read_csv(data_filepath,delimiter=' ', nrows=1)
File "C:\Users\GilLevi\Anaconda\lib\site-packages\pandas\io\parsers.py", line 443, in parser_f
return _read(filepath_or_buffer, kwds)
File "C:\Users\GilLevi\Anaconda\lib\site-packages\pandas\io\parsers.py", line 231, in _read
return parser.read(nrows)
File "C:\Users\GilLevi\Anaconda\lib\site-packages\pandas\io\parsers.py", line 686, in read
ret = self._engine.read(nrows)
File "C:\Users\GilLevi\Anaconda\lib\site-packages\pandas\io\parsers.py", line 1130, in read
data = self._reader.read(nrows)
File "parser.pyx", line 727, in pandas.parser.TextReader.read (pandas\parser.c:7146)
File "parser.pyx", line 774, in pandas.parser.TextReader._read_low_memory (pandas\parser.c:7707)
StopIteration
When trying to read a similar file that contains only one line of 65K characters, I also get the following error:
Traceback (most recent call last): File "C:\\eclipse\\plugins\\org.python.pydev_3.7.1.201409021729\\pysrc\\pydevd.py", line 2090, in debugger.run(setup['file'], None, None) File "C:\\eclipse\\plugins\\org.python.pydev_3.7.1.201409021729\\pysrc\\pydevd.py", line 1547, in run pydev_imports.execfile(file, globals, locals) # execute the script File "D:\\BinaryDescriptors3\\Python\\LearnTripletsOrb\\LearnTripletsOrb.py", line 20, in data = pd.read_csv(data_filepath,delimiter=' ', nrows=1) File "C:\\Users\\GilLevi\\Anaconda\\lib\\site-packages\\pandas\\io\\parsers.py", line 443, in parser_f return _read(filepath_or_buffer, kwds)
File "C:\\Users\\GilLevi\\Anaconda\\lib\\site-packages\\pandas\\io\\parsers.py", line 231, in _read return parser.read(nrows) File "C:\\Users\\GilLevi\\Anaconda\\lib\\site-packages\\pandas\\io\\parsers.py", line 686, in read ret = self._engine.read(nrows) File "C:\\Users\\GilLevi\\Anaconda\\lib\\site-packages\\pandas\\io\\parsers.py", line 1130, in read data = self._reader.read(nrows) File "parser.pyx", line 727, in pandas.parser.TextReader.read (pandas\\parser.c:7146) File "parser.pyx", line 774, in pandas.parser.TextReader._read_low_memory (pandas\\parser.c:7707) StopIteration
I also trying to produce a similar file that contains 2 lines of 65K but uses "," as a delimiter, and got the same error as in 1 and 2.
If load_csv is not the correct approach, can you please recommend a suitable alternative?
The question is old but I hope others might find the answer useful.
Pandas (less so NumPy) is optimized (and very good) at working with data that has plenty of rows and a limited number of columns (say, a few dozen tops). Your case seems to be the opposite so it is not the right tool for the task.
I would preprocess the data before loading it into a DataFrame and I would swap columns and rows in the DataFrame for further processing. So it goes something like this:
df = pd.DataFrame(columns=[i for i in range(len(txt))])
txt = open(data_filepath).readlines()
for i, ln in enumerate(txt):
row_items = ln.split()
df[i] = row_items
...
I believe this is going to be quite fast...
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.