简体   繁体   English

通过 FTP 逐行读取 CSV 而不将整个文件存储在内存/磁盘中

[英]Read CSV over FTP line by line without storing the whole file in memory/disk

I'm stuck piping ftplib.FTP.retrlines to csv.reader ...我被困在管道ftplib.FTP.retrlinescsv.reader ...

FTP.retrlines repeatedly calls a callback with a line in it, while csv.reader expects an iterator which returns a string each time its __next__() method is called. FTP.retrlines重复调用包含一行的回调,而csv.reader需要一个迭代器,每次调用其__next__()方法时都会返回一个字符串。

How do I combine the two things together so that I can read and process the file without reading the whole file in advance and eg storing it in a eg io.TextIOWrapper ?如何将这两件事结合在一起,以便我可以读取和处理文件而无需提前读取整个文件,例如将其存储在例如io.TextIOWrapper

My problem is FTP.retrlines won't return until it consumed the whole file...我的问题是FTP.retrlines在消耗整个文件之前不会返回...

I'm not sure if there's not a better solution, but you can glue the FTP.retrlines and csv.reader together using iterable queue-like object.我不确定是否没有更好的解决方案,但是您可以使用类似可迭代队列的 ZA8CFDE6331BD59EB2AC96F8911C4B66Z6 将FTP.retrlinescsv.reader在一起。 And as both the functions are synchronous, you have to run them on different threads in parallel.由于这两个函数都是同步的,因此您必须在不同的线程上并行运行它们。

Something like this:像这样的东西:

from queue import Queue
from ftplib import FTP
from threading import Thread
import csv
 
ftp = FTP(host)
ftp.login(username, password)

class LineQueue:
    _queue = Queue(10)

    def add(self, s):
        print(f"Queueing line {s}")
        self._queue.put(s)
        print(f"Queued line {s}")

    def done(self):
        print("Signaling Done")
        self._queue.put(False)
        print("Signaled Done")

    def __iter__(self):
        print("Reading lines")
        while True:
            print("Reading line")
            s = self._queue.get()
            if s == False:
                print("Read all lines")
                break

            print(f"Read line {s}")
            yield s

q = LineQueue()

def download():
    ftp.retrlines("RETR /path/data.csv", q.add)
    q.done()

thread = Thread(target=download)
thread.start()

print("Reading CSV")
for entry in csv.reader(q):
    print(entry)

print("Read CSV")

thread.join()

Same solution as Martin's , just saved some line of code subclassing queue.Queue directly.Martin 的解决方案相同,只是直接保存了一些代码行子类queue.Queue

from queue import Queue
from ftplib import FTP
from threading import Thread
import csv
 
ftp = FTP(**ftp_credentials)

class LineQueue(Queue):
    def __iter__(self):
        while True:
            s = self.get()
            if s is None:
                break
            yield s

    def __call__(self):
        ftp.retrlines(f"RETR {fname}", self.put)
        self.put(None)

q = LineQueue(10)

thread = Thread(target=q)
thread.start()

for entry in csv.reader(q):
    print(entry)

thread.join()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM