简体   繁体   中英

Convert txt with 300 million rows to numpy array

I have a txt file that contains more than 300 million rows with 1 column (integer values). I'm trying to read and convert it to numpy array. Currently I have tried

label=np.loadtxt('/path/to/file')

and

for lines in fileinput.input ('/path/to/file'):
    do_something_with(lines)

It seems np.loadtxt has a slightly faster performance but it still needs around 2 hours to process a single txt file. The file has more than 300 million rows with 1 column, but its size is only around 950mb. I'm suspecting that np.loadtxt is also reading the file line by line which causes the processing time to be quite long. I'm wondering if there is any method that can speed up this reading and converting process while keeping the sequence of the rows .

Thanks a lot for your great support and help

Sounds like the file is simple enough that readlines may work

Make a small sample file:

In [2]: arr = np.random.random((100,1))
In [4]: np.savetxt('test.txt', arr, fmt='%f')
In [6]: !head test.txt
0.872225
0.365394
0.802365
0.140455
0.041390
0.531483
0.415459
0.906439
0.789604
0.493369

The straightforward loadtxt :

In [8]: arr1 = np.loadtxt('test.txt')
In [9]: arr1.shape
Out[9]: (100,)

I think there's a parameter to force a (100,1) shape, I leave that for now.

Let's try a readlines , using np.array to convert the list of strings to float array:

In [11]: arr2 = np.array(open('test.txt').readlines(), dtype=float)
In [12]: arr2.shape
Out[12]: (100,)
In [13]: np.allclose(arr1,arr2)
Out[13]: True

Compare times:

In [14]: timeit arr2 = np.array(open('test.txt').readlines(), dtype=float)
77.5 µs ± 961 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [15]: timeit arr1 = np.loadtxt('test.txt')
605 µs ± 1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

A lot faster with readlines.

Another approach is fromfile :

In [18]: arr3 = np.fromfile('test.txt',dtype=float, sep=' ')
In [19]: arr3.shape
Out[19]: (100,)
In [20]: np.allclose(arr1,arr3)
Out[20]: True
In [21]: timeit arr3 = np.fromfile('test.txt',dtype=float, sep=' ')
118 µs ± 4.22 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Not quite as fast, but still better than loadtxt . Yes, loadtxt reads the file line by line.

genfromtxt is a bit better than loadtxt .

pandas is supposed to have a fast csv reader, but that doesn't seem to be the case here:

In [33]: timeit arr4=pd.read_csv('test.txt',header=None).to_numpy()
1.12 ms ± 26.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM