简体   繁体   English

在Python中读取输入的最快方法

[英]The fastest way to read input in Python

I want to read a huge text file that contains list of lists of integers. 我想读一个包含整数列表列表的大文本文件。 Now I'm doing the following: 现在我正在做以下事情:

G = []
with open("test.txt", 'r') as f:
    for line in f:
        G.append(list(map(int,line.split())))

However, it takes about 17 secs (via timeit). 但是,它需要大约17秒(通过时间)。 Is there any way to reduce this time? 有没有办法减少这个时间? Maybe, there is a way not to use map. 也许,有一种方法不使用地图。

numpy has the functions loadtxt and genfromtxt , but neither is particularly fast. numpy具有loadtxtgenfromtxt的功能,但两者都不是特别快。 One of the fastest text readers available in a widely distributed library is the read_csv function in pandas ( http://pandas.pydata.org/ ). 在一个广泛分布的库最快的文本阅读器是read_csv在功能pandashttp://pandas.pydata.org/ )。 On my computer, reading 5 million lines containing two integers per line takes about 46 seconds with numpy.loadtxt , 26 seconds with numpy.genfromtxt , and a little over 1 second with pandas.read_csv . 在我的计算机上,每行读取500万行包含两个整数的行使用numpy.loadtxt需要大约46秒,使用numpy.loadtxt需要26秒,使用numpy.genfromtxt 1秒多pandas.read_csv

Here's the session showing the result. 这是显示结果的会话。 (This is on Linux, Ubuntu 12.04 64 bit. You can't see it here, but after each reading of the file, the disk cache was cleared by running sync; echo 3 > /proc/sys/vm/drop_caches in a separate shell.) (这是在Linux上,Ubuntu 12.04 64位。你在这里看不到它,但是在每次读取文件后,通过运行sync; echo 3 > /proc/sys/vm/drop_caches清除磁盘缓存sync; echo 3 > /proc/sys/vm/drop_caches在一个单独的贝壳。)

In [1]: import pandas as pd

In [2]: %timeit -n1 -r1 loadtxt('junk.dat')
1 loops, best of 1: 46.4 s per loop

In [3]: %timeit -n1 -r1 genfromtxt('junk.dat')
1 loops, best of 1: 26 s per loop

In [4]: %timeit -n1 -r1 pd.read_csv('junk.dat', sep=' ', header=None)
1 loops, best of 1: 1.12 s per loop

pandas which is based on numpy has a C based file parser which is very fast: 基于numpy pandas有一个基于C文件解析器 ,速度非常快:

# generate some integer data (5 M rows, two cols) and write it to file
In [24]: data = np.random.randint(1000, size=(5 * 10**6, 2))

In [25]: np.savetxt('testfile.txt', data, delimiter=' ', fmt='%d')

# your way
In [26]: def your_way(filename):
   ...:     G = []
   ...:     with open(filename, 'r') as f:
   ...:         for line in f:
   ...:             G.append(list(map(int, line.split(','))))
   ...:     return G        
   ...: 

In [26]: %timeit your_way('testfile.txt', ' ')
1 loops, best of 3: 16.2 s per loop

In [27]: %timeit pd.read_csv('testfile.txt', delimiter=' ', dtype=int)
1 loops, best of 3: 1.57 s per loop

So pandas.read_csv takes about one and a half second to read your data and is about 10 times faster than your method. 因此, pandas.read_csv需要大约一秒半的时间来读取您的数据,并且比您的方法快10倍。

As a general rule of thumb (for just about any language), using read() to read in the entire file is going to be quicker than reading one line at a time. 作为一般经验法则(几乎任何语言),使用read()读取整个文件比一次读取一行更快。 If you're not constrained by memory, read the whole file at once and then split the data on newlines, then iterate over the list of lines. 如果您不受内存约束,请立即读取整个文件,然后在换行符上拆分数据,然后遍历行列表。

The easiest speedup would be to go for PyPy http://pypy.org/ 最简单的加速是去PyPy http://pypy.org/

The next issue to NOT read the file at all (if possible). 下一个问题是根本不读取文件(如果可能的话)。 Instead process it like a stream. 而是像流一样处理它。

List comprehensions are often faster. 列表理解通常更快。

G = [[int(item) item in line.split()] for line in f]

Beyond that, try PyPy and Cython and numpy 除此之外,尝试PyPy和Cython以及numpy

You might also try to bring the data into a database via bulk-insert, then processing your records with set operations. 您也可以尝试通过批量插入将数据导入数据库,然后使用set操作处理记录。 Depending on what you have to do, that may be faster, as bulk-insert software is optimized for this type of task. 根据您的操作,可能会更快,因为批量插入软件针对此类任务进行了优化。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM