简体   繁体   English

调整迭代器,使其行为类似于 Python 中的文件 object

[英]Adapt an iterator to behave like a file-like object in Python

I have a generator producing a list of strings.我有一个生成器生成字符串列表。 Is there a utility/adapter in Python that could make it look like a file? Python 中是否有一个实用程序/适配器可以使它看起来像一个文件?

For example,例如,

>>> def str_fn():
...     for c in 'a', 'b', 'c':
...         yield c * 3
... 
>>> for s in str_fn():
...     print s
... 
aaa
bbb
ccc
>>> stream = some_magic_adaptor(str_fn())
>>> while True:
...    data = stream.read(4)
...    if not data:
...        break
...    print data
aaab
bbcc
c

Because data may be big and needs to be streamable (each fragment is a few kilobytes, the entire stream is tens of megabytes), I do not want to eagerly evaluate the whole generator before passing it to stream adaptor.因为数据可能很大并且需要流式传输(每个片段是几千字节,整个 stream 是几十兆字节),我不想在将整个生成器传递给 stream 适配器之前急切地评估整个生成器。

The "correct" way to do this is inherit from a standard Python io abstract base class. 执行此操作的“正确”方法是从标准Python io抽象基类继承。 However it doesn't appear that Python allows you to provide a raw text class, and wrap this with a buffered reader of any kind. 但是,似乎Python不允许您提供原始文本类,并使用任何类型的缓冲读取器包装它。

The best class to inherit from is TextIOBase . 继承的最佳类是TextIOBase Here's such an implementation, handling readline , and read while being mindful of performance. 这是一个实现,处理readlineread同时注意性能。 ( gist ) 要点

import io

class StringIteratorIO(io.TextIOBase):

    def __init__(self, iter):
        self._iter = iter
        self._left = ''

    def readable(self):
        return True

    def _read1(self, n=None):
        while not self._left:
            try:
                self._left = next(self._iter)
            except StopIteration:
                break
        ret = self._left[:n]
        self._left = self._left[len(ret):]
        return ret

    def read(self, n=None):
        l = []
        if n is None or n < 0:
            while True:
                m = self._read1()
                if not m:
                    break
                l.append(m)
        else:
            while n > 0:
                m = self._read1(n)
                if not m:
                    break
                n -= len(m)
                l.append(m)
        return ''.join(l)

    def readline(self):
        l = []
        while True:
            i = self._left.find('\n')
            if i == -1:
                l.append(self._left)
                try:
                    self._left = next(self._iter)
                except StopIteration:
                    self._left = ''
                    break
            else:
                l.append(self._left[:i+1])
                self._left = self._left[i+1:]
                break
        return ''.join(l)

Here's a solution that should read from your iterator in chunks. 这是一个应该从块中读取迭代器的解决方案。

class some_magic_adaptor:
  def __init__( self, it ):
    self.it = it
    self.next_chunk = ""
  def growChunk( self ):
    self.next_chunk = self.next_chunk + self.it.next()
  def read( self, n ):
    if self.next_chunk == None:
      return None
    try:
      while len(self.next_chunk)<n:
        self.growChunk()
      rv = self.next_chunk[:n]
      self.next_chunk = self.next_chunk[n:]
      return rv
    except StopIteration:
      rv = self.next_chunk
      self.next_chunk = None
      return rv


def str_fn():
  for c in 'a', 'b', 'c':
    yield c * 3

ff = some_magic_adaptor( str_fn() )

while True:
  data = ff.read(4)
  if not data:
    break
  print data

The problem with StringIO is that you have to load everything into the buffer up front. StringIO的问题是您必须预先将所有内容加载到缓冲区中。 This can be a problem if the generator is infinite :) 如果发电机是无限的,这可能是一个问题:)

from itertools import chain, islice
class some_magic_adaptor(object):
    def __init__(self, src):
        self.src = chain.from_iterable(src)
    def read(self, n):
        return "".join(islice(self.src, None, n))

There is one called werkzeug.contrib.iterio.IterIO but note that it stores the entire iterator in its memory (up to the point you have read it as a file) so it might not be suitable. 有一个名为werkzeug.contrib.iterio.IterIO但请注意它将整个迭代器存储在其内存中(直到您将其作为文件读取),因此它可能不合适。

http://werkzeug.pocoo.org/docs/contrib/iterio/ http://werkzeug.pocoo.org/docs/contrib/iterio/

Source: https://github.com/mitsuhiko/werkzeug/blob/master/werkzeug/contrib/iterio.py 资料来源: https//github.com/mitsuhiko/werkzeug/blob/master/werkzeug/contrib/iterio.py

An open bug on readline / iter : https://github.com/mitsuhiko/werkzeug/pull/500 readline / iter上的一个漏洞: https//github.com/mitsuhiko/werkzeug/pull/500

Here's a modified version of John and Matt's answer that can read a list/generator of strings and output bytearrays 这是John和Matt的答案的修改版本,可以读取字符串的列表/生成器并输出字节数组

import itertools as it
from io import TextIOBase

class IterStringIO(TextIOBase):
    def __init__(self, iterable=None):
        iterable = iterable or []
        self.iter = it.chain.from_iterable(iterable)

    def not_newline(self, s):
        return s not in {'\n', '\r', '\r\n'}

    def write(self, iterable):
        to_chain = it.chain.from_iterable(iterable)
        self.iter = it.chain.from_iterable([self.iter, to_chain])

    def read(self, n=None):
        return bytearray(it.islice(self.iter, None, n))

    def readline(self, n=None):
        to_read = it.takewhile(self.not_newline, self.iter)
        return bytearray(it.islice(to_read, None, n))

usage: 用法:

ff = IterStringIO(c * 3 for c in ['a', 'b', 'c'])

while True:
    data = ff.read(4)

    if not data:
        break

    print data

aaab
bbcc
c

alternate usage: 替代用法:

ff = IterStringIO()
ff.write('ddd')
ff.write(c * 3 for c in ['a', 'b', 'c'])

while True:
    data = ff.read(4)

    if not data:
        break

    print data

ddda
aabb
bccc

Looking at Matt's answer, I can see that it's not always necessary to implement all the read methods. 看看马特的答案,我可以看到并不总是需要实现所有的读取方法。 read1 may be sufficient, which is described as: read1可能就足够了,其描述如下:

Read and return up to size bytes, with at most one call to the underlying raw stream's read()... 读取并返回大小字节,最多一次调用底层原始流的read()...

Then it can be wrapped with io.TextIOWrapper which, for instance, has implementation of readline . 然后它可以用io.TextIOWrapper包装,例如,它具有readline实现。 As an example here's streaming of CSV-file from S3's (Amazon Simple Storage Service) boto.s3.key.Key which implements iterator for reading. 作为一个例子,这里是从S3(亚马逊简单存储服务) boto.s3.key.Key流式传输CSV文件,它实现了读取的迭代器。

import io
import csv

from boto import s3


class StringIteratorIO(io.TextIOBase):

    def __init__(self, iter):
        self._iterator = iter
        self._buffer = ''

    def readable(self):
        return True

    def read1(self, n=None):
        while not self._buffer:
            try:
                self._buffer = next(self._iterator)
            except StopIteration:
                break
        result = self._buffer[:n]
        self._buffer = self._buffer[len(result):]
        return result


conn = s3.connect_to_region('some_aws_region')
bucket = conn.get_bucket('some_bucket')
key = bucket.get_key('some.csv')    

fp = io.TextIOWrapper(StringIteratorIO(key))
reader = csv.DictReader(fp, delimiter = ';')
for row in reader:
    print(row)

Update 更新

Here's an answer to related question which looks a little better. 这是相关问题的答案 ,看起来好一点。 It inherits io.RawIOBase and overrides readinto . 它继承了io.RawIOBase并覆盖了readinto In Python 3 it's sufficient, so instead of wrapping IterStream in io.BufferedReader one can wrap it in io.TextIOWrapper . 在Python 3中它已经足够了,所以不是在io.BufferedReader中包装IterStreamio.BufferedReader可以将它包装在io.TextIOWrapper In Python 2 read1 is needed but it can be simply expressed though readinto . 在Python 2中需要read1 ,但它可以通过readinto简单地表达。

this is exactly what stringIO is for .. 这正是stringIO的用途。

>>> import StringIO
>>> some_var = StringIO.StringIO("Hello World!")
>>> some_var.read(4)
'Hell'
>>> some_var.read(4)
'o Wo'
>>> some_var.read(4)
'rld!'
>>>

Or if you wanna do what it sounds like 或者,如果你想做它听起来像

Class MyString(StringIO.StringIO):
     def __init__(self,*args):
         StringIO.StringIO.__init__(self,"".join(args))

then you can simply 那么你可以简单

xx = MyString(*list_of_strings)

If you only need a read method, then this can be enough如果你只需要一个read方法,那么这就足够了

def to_file_like_obj(iterable, base):
    chunk = base()
    offset = 0
    it = iter(iterable)

    def up_to_iter(size):
        nonlocal chunk, offset

        while size:
            if offset == len(chunk):
                try:
                    chunk = next(it)
                except StopIteration:
                    break
                else:
                    offset = 0
            to_yield = min(size, len(chunk) - offset)
            offset = offset + to_yield
            size -= to_yield
            yield chunk[offset - to_yield:offset]

    class FileLikeObj:
        def read(self, size=-1):
            return base().join(up_to_iter(float('inf') if size is None or size < 0 else size))

    return FileLikeObj()

which can be used for an iterable yielding str可用于可迭代的产生str

my_file = to_file_like_object(str_fn, str)

or if you have an iterable yielding bytes rather than str , and you want a file-like object whose read method returns bytes或者,如果您有一个可迭代的产生bytes而不是str ,并且您想要一个类似文件的对象,其 read 方法返回bytes

my_file = to_file_like_object(bytes_fn, bytes)

This pattern has a few nice properties I think:我认为这种模式有一些不错的特性:

  • Not much code, which can be used for both str and bytes代码不多, strbytes都可以用
  • Returns exactly what has been asked for in terms of length, in both of the cases of the iterable yielding small chunks, and big chunks (other than at the end of the iterable)在可迭代生成小块和大块的两种情况下(在可迭代的末尾除外),准确返回长度要求的内容
  • Does not append str / bytes - so avoids copyingappend str / bytes - 所以避免复制
  • Leverages slicing - so also avoids copying because a slice of a str / bytes that should be the entire instance will return exactly that same instance利用切片——因此也避免了复制,因为应该是整个实例的str / bytes切片将返回完全相同的实例

First of all, your generator will have to yield byte objects. 首先,您的生成器必须生成字节对象。 While there isn't anything built-in, you can use a combination of http://docs.python.org/library/stringio.html and itertools.chain. 虽然没有内置任何内容,但您可以使用http://docs.python.org/library/stringio.html和itertools.chain的组合。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM