简体   繁体   English

如何从 stream 读取 CSV 文件并在写入时处理每一行?

[英]How to read a CSV file from a stream and process each line as it is written?

I would like to read a CSV file from the standard input and process each row as it comes.我想从标准输入中读取 CSV 文件并处理每一行。 My CSV outputting code writes rows one by one, but my reader waits the stream to be terminated before iterating the rows.我的 CSV 输出代码一行一行地写入行,但我的读者在迭代行之前等待 stream 终止。 Is this a limitation of csv module?这是csv模块的限制吗? Am I doing something wrong?难道我做错了什么?

My reader code:我的阅读器代码:

import csv
import sys
import time


reader = csv.reader(sys.stdin)
for row in reader:
    print "Read: (%s) %r" % (time.time(), row)

My writer code:我的作家代码:

import csv
import sys
import time


writer = csv.writer(sys.stdout)
for i in range(8):
    writer.writerow(["R%d" % i, "$" * (i+1)])
    sys.stdout.flush()
    time.sleep(0.5)

Output of python test_writer.py | python test_reader.py python 的python test_writer.py | python test_reader.py python test_writer.py | python test_reader.py : python test_writer.py | python test_reader.py

Read: (1309597426.3) ['R0', '$']
Read: (1309597426.3) ['R1', '$$']
Read: (1309597426.3) ['R2', '$$$']
Read: (1309597426.3) ['R3', '$$$$']
Read: (1309597426.3) ['R4', '$$$$$']
Read: (1309597426.3) ['R5', '$$$$$$']
Read: (1309597426.3) ['R6', '$$$$$$$']
Read: (1309597426.3) ['R7', '$$$$$$$$']

As you can see all print statements are executed at the same time, but I expect there to be a 500ms gap.如您所见,所有打印语句都是同时执行的,但我预计会有 500 毫秒的间隔。

As it says in the documentation ,正如文档中所说

In order to make a for loop the most efficient way of looping over the lines of a file (a very common operation), the next() method uses a hidden read-ahead buffer.为了使for循环成为循环文件行的最有效方式(一种非常常见的操作), next()方法使用隐藏的预读缓冲区。

And you can see by looking at the implementation of the csv module (line 784) that csv.reader calls the next() method of the underlyling iterator (via PyIter_Next ).通过查看csv模块(第 784 行)的实现,您可以看到csv.reader调用了底层迭代器的next()方法(通过PyIter_Next )。

So if you really want unbuffered reading of CSV files, you need to convert the file object (here sys.stdin ) into an iterator whose next() method actually calls readline() instead.因此,如果您真的想要无缓冲读取 CSV 文件,则需要将文件 object (此处为sys.stdin )转换为迭代器,其next()方法实际上调用readline() This can easily be done using the two-argument form of the iter function.这可以使用iter器 function 的两个参数形式轻松完成。 So change the code in test_reader.py to something like this:因此,将test_reader.py中的代码更改为如下所示:

for row in csv.reader(iter(sys.stdin.readline, '')):
    print("Read: ({}) {!r}".format(time.time(), row))

For example,例如,

$ python test_writer.py | python test_reader.py
Read: (1388776652.964925) ['R0', '$']
Read: (1388776653.466134) ['R1', '$$']
Read: (1388776653.967327) ['R2', '$$$']
Read: (1388776654.468532) ['R3', '$$$$']
[etc]

Can you explain why you need unbuffered reading of CSV files?您能解释一下为什么需要无缓冲读取 CSV 文件吗? There might be a better solution to whatever it is you are trying to do.无论您尝试做什么,都可能有更好的解决方案。

Maybe it's a limitation.也许这是一个限制。 Read this http://docs.python.org/using/cmdline.html#cmdoption-unittest-discover-u阅读此http://docs.python.org/using/cmdline.html#cmdoption-unittest-discover-u

Note that there is internal buffering in file.readlines() and File Objects (for line in sys.stdin) which is not influenced by this option.请注意,file.readlines() 和文件对象(用于 sys.stdin 中的行)中有内部缓冲,不受此选项的影响。 To work around this, you will want to use file.readline() inside a while 1: loop.要解决这个问题,您需要在 while 1: 循环中使用 file.readline()。

I modified test_reader.py as follows:我修改了 test_reader.py 如下:

import csv, sys, time

while True:
    print "Read: (%s) %r" % (time.time(), sys.stdin.readline())

Output Output

python test_writer.py | python  test_reader.py
Read: (1309600865.84) 'R0,$\r\n'
Read: (1309600865.84) 'R1,$$\r\n'
Read: (1309600866.34) 'R2,$$$\r\n'
Read: (1309600866.84) 'R3,$$$$\r\n'
Read: (1309600867.34) 'R4,$$$$$\r\n'
Read: (1309600867.84) 'R5,$$$$$$\r\n'
Read: (1309600868.34) 'R6,$$$$$$$\r\n'
Read: (1309600868.84) 'R7,$$$$$$$$\r\n'

You are flushing stdout, but not stdin.您正在刷新标准输出,但不是标准输入。

Sys.stdin also has a flush() method, try using that after each line read if you really want to disable the buffering. Sys.stdin也有一个flush()方法,如果您真的想禁用缓冲,请尝试在每行读取后使用该方法。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 使用Python Pandas,读取xlsx文件中写入的多个文件夹路径,分别处理每个csv文件 - Using Python Pandas, read multiple folder paths written in xlsx file and process each csv file separately 如何通过每一行处理大的json文件并有效地转换为csv? - How to process big json file by each line and convert to csv efficiently? 如何读取CSV或文本文件的行,循环遍历每行并保存为每行读取的新文件 - How To Read Lines of CSV or Text File, Loop Over Each Line and Save To a New File For Each Line Read 读取 CSV 文件并使用 Python 从每一行进行 HTTP POST - Read a CSV file and make HTTP POST from each line with Python 如何将文件的每一行读取到单独的列表中以分别处理它们 - How to read each line of a file to a separate list to process them individually Python:如何读取每行都是一个字符串的 CSV 文件? - Python: How to read a CSV file in which each line is a string? 每次更改变量时都会覆盖 CSV 文件中的行 - Line in CSV file gets over written each time variable is changed 如何创建一个循环以从 csv 文件中逐行读取 - how to create a loop to read line by line from a csv file 每次另一个进程更新文件时如何从文件中读取 - How to read from a file each time another another process updates it 如何从 csv 文件中读取特定行的数据? - How to read a specific line of data from a csv file?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM