简体   繁体   English

在Python中读取以空格分隔的数据的最快方法

[英]Fastest way to read data separated by white space in Python

I have some data that is separated by white space from which I want to extract certain columns. 我有一些要用空格分隔的数据,我要从中提取某些列。 In the past I have always used something like the following in Python, in which I have removed the non-essentials: 过去,我一直在Python中使用类似以下的内容,其中删除了不必要的内容:

for line in open(f,'r'):
    l = line.split()
    print " ".join(l[1:3])

I'm wondering though whether this is the fastest way to do this. 我想知道这是否是最快的方法。 If I compare to another software package (written in C) that reads the same data, my code is significantly slower. 如果将我与另一个读取相同数据的软件包(用C编写)进行比较,则我的代码会明显变慢。 Is this simply because I/O in C is faster or am I writing suboptimal code? 这仅仅是因为C中的I / O速度更快还是我在编写次优代码?

You can get the expected columns in a list using list comprehensions. 您可以使用列表推导来获取列表中的预期列。

expectedColumns = [" ".join(x) for x in [line.split()[0:2] for line in file("testFile",'r').readlines()]]

If you want to print the columns inside LC, you can do this :) 如果要在LC内打印列,可以执行此操作:)

from __future__ import print_function
[print(" ".join(x)) for x in [line.split()[0:2] for line in file("testFile",'r').readlines()]]

When you directly iterate over a file , the file is read line by line. 当您直接遍历file ,将逐行读取文件。 This helps for huge files but at the penalty of IO, even though if it implements read ahead buffer. 这有助于处理大文件,但即使执行预读缓冲区,也要付出IO的代价。 AFAIK, internally it uses seek and tell when iterating. AFAIK,在内部它使用seektell何时进行迭代。

If you do read() , it will read the entire content once but at the cost of memory. 如果您执行read() ,它将读取一次全部内容,但会占用内存。 In your case you can do read().split('\\n') or readlines() (preferred) and it will be faster than iterating over the file directly. 在您的情况下,您可以执行read().split('\\n')readlines() (首选),它比直接遍历文件更快。

In addition to the above, please use context managers when dealing with files, so they are closed once done. 除上述内容外,在处理文件时,请使用上下文管理器,因此一旦完成将它们关闭。

Docs 文件

You might want to look at the CSV module. 您可能需要查看CSV模块。 csv.reader is implemented in c and should be faster than using pure python. csv.reader是用c实现的,应该比使用纯python更快。

import csv
with open(f, 'rb') as file:
    r = csv.reader(file, delimiter=' ')
    for line in r:
        print ' '.join(line[1:3])

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM