[英]Adapt an iterator to behave like a file-like object in Python
I have a generator producing a list of strings.我有一个生成器生成字符串列表。 Is there a utility/adapter in Python that could make it look like a file?
Python 中是否有一个实用程序/适配器可以使它看起来像一个文件?
For example,例如,
>>> def str_fn():
... for c in 'a', 'b', 'c':
... yield c * 3
...
>>> for s in str_fn():
... print s
...
aaa
bbb
ccc
>>> stream = some_magic_adaptor(str_fn())
>>> while True:
... data = stream.read(4)
... if not data:
... break
... print data
aaab
bbcc
c
Because data may be big and needs to be streamable (each fragment is a few kilobytes, the entire stream is tens of megabytes), I do not want to eagerly evaluate the whole generator before passing it to stream adaptor.因为数据可能很大并且需要流式传输(每个片段是几千字节,整个 stream 是几十兆字节),我不想在将整个生成器传递给 stream 适配器之前急切地评估整个生成器。
The "correct" way to do this is inherit from a standard Python io
abstract base class. 执行此操作的“正确”方法是从标准Python
io
抽象基类继承。 However it doesn't appear that Python allows you to provide a raw text class, and wrap this with a buffered reader of any kind. 但是,似乎Python不允许您提供原始文本类,并使用任何类型的缓冲读取器包装它。
The best class to inherit from is TextIOBase
. 继承的最佳类是
TextIOBase
。 Here's such an implementation, handling readline
, and read
while being mindful of performance. 这是一个实现,处理
readline
和read
同时注意性能。 ( gist ) ( 要点 )
import io
class StringIteratorIO(io.TextIOBase):
def __init__(self, iter):
self._iter = iter
self._left = ''
def readable(self):
return True
def _read1(self, n=None):
while not self._left:
try:
self._left = next(self._iter)
except StopIteration:
break
ret = self._left[:n]
self._left = self._left[len(ret):]
return ret
def read(self, n=None):
l = []
if n is None or n < 0:
while True:
m = self._read1()
if not m:
break
l.append(m)
else:
while n > 0:
m = self._read1(n)
if not m:
break
n -= len(m)
l.append(m)
return ''.join(l)
def readline(self):
l = []
while True:
i = self._left.find('\n')
if i == -1:
l.append(self._left)
try:
self._left = next(self._iter)
except StopIteration:
self._left = ''
break
else:
l.append(self._left[:i+1])
self._left = self._left[i+1:]
break
return ''.join(l)
Here's a solution that should read from your iterator in chunks. 这是一个应该从块中读取迭代器的解决方案。
class some_magic_adaptor:
def __init__( self, it ):
self.it = it
self.next_chunk = ""
def growChunk( self ):
self.next_chunk = self.next_chunk + self.it.next()
def read( self, n ):
if self.next_chunk == None:
return None
try:
while len(self.next_chunk)<n:
self.growChunk()
rv = self.next_chunk[:n]
self.next_chunk = self.next_chunk[n:]
return rv
except StopIteration:
rv = self.next_chunk
self.next_chunk = None
return rv
def str_fn():
for c in 'a', 'b', 'c':
yield c * 3
ff = some_magic_adaptor( str_fn() )
while True:
data = ff.read(4)
if not data:
break
print data
The problem with StringIO is that you have to load everything into the buffer up front. StringIO的问题是您必须预先将所有内容加载到缓冲区中。 This can be a problem if the generator is infinite :)
如果发电机是无限的,这可能是一个问题:)
from itertools import chain, islice
class some_magic_adaptor(object):
def __init__(self, src):
self.src = chain.from_iterable(src)
def read(self, n):
return "".join(islice(self.src, None, n))
There is one called werkzeug.contrib.iterio.IterIO
but note that it stores the entire iterator in its memory (up to the point you have read it as a file) so it might not be suitable. 有一个名为
werkzeug.contrib.iterio.IterIO
但请注意它将整个迭代器存储在其内存中(直到您将其作为文件读取),因此它可能不合适。
http://werkzeug.pocoo.org/docs/contrib/iterio/ http://werkzeug.pocoo.org/docs/contrib/iterio/
Source: https://github.com/mitsuhiko/werkzeug/blob/master/werkzeug/contrib/iterio.py 资料来源: https : //github.com/mitsuhiko/werkzeug/blob/master/werkzeug/contrib/iterio.py
An open bug on readline
/ iter
: https://github.com/mitsuhiko/werkzeug/pull/500 readline
/ iter
上的一个漏洞: https : //github.com/mitsuhiko/werkzeug/pull/500
Here's a modified version of John and Matt's answer that can read a list/generator of strings and output bytearrays 这是John和Matt的答案的修改版本,可以读取字符串的列表/生成器并输出字节数组
import itertools as it
from io import TextIOBase
class IterStringIO(TextIOBase):
def __init__(self, iterable=None):
iterable = iterable or []
self.iter = it.chain.from_iterable(iterable)
def not_newline(self, s):
return s not in {'\n', '\r', '\r\n'}
def write(self, iterable):
to_chain = it.chain.from_iterable(iterable)
self.iter = it.chain.from_iterable([self.iter, to_chain])
def read(self, n=None):
return bytearray(it.islice(self.iter, None, n))
def readline(self, n=None):
to_read = it.takewhile(self.not_newline, self.iter)
return bytearray(it.islice(to_read, None, n))
usage: 用法:
ff = IterStringIO(c * 3 for c in ['a', 'b', 'c'])
while True:
data = ff.read(4)
if not data:
break
print data
aaab
bbcc
c
alternate usage: 替代用法:
ff = IterStringIO()
ff.write('ddd')
ff.write(c * 3 for c in ['a', 'b', 'c'])
while True:
data = ff.read(4)
if not data:
break
print data
ddda
aabb
bccc
Looking at Matt's answer, I can see that it's not always necessary to implement all the read methods. 看看马特的答案,我可以看到并不总是需要实现所有的读取方法。
read1
may be sufficient, which is described as: read1
可能就足够了,其描述如下:
Read and return up to size bytes, with at most one call to the underlying raw stream's read()...
读取并返回大小字节,最多一次调用底层原始流的read()...
Then it can be wrapped with io.TextIOWrapper
which, for instance, has implementation of readline
. 然后它可以用
io.TextIOWrapper
包装,例如,它具有readline
实现。 As an example here's streaming of CSV-file from S3's (Amazon Simple Storage Service) boto.s3.key.Key
which implements iterator for reading. 作为一个例子,这里是从S3(亚马逊简单存储服务)
boto.s3.key.Key
流式传输CSV文件,它实现了读取的迭代器。
import io
import csv
from boto import s3
class StringIteratorIO(io.TextIOBase):
def __init__(self, iter):
self._iterator = iter
self._buffer = ''
def readable(self):
return True
def read1(self, n=None):
while not self._buffer:
try:
self._buffer = next(self._iterator)
except StopIteration:
break
result = self._buffer[:n]
self._buffer = self._buffer[len(result):]
return result
conn = s3.connect_to_region('some_aws_region')
bucket = conn.get_bucket('some_bucket')
key = bucket.get_key('some.csv')
fp = io.TextIOWrapper(StringIteratorIO(key))
reader = csv.DictReader(fp, delimiter = ';')
for row in reader:
print(row)
Here's an answer to related question which looks a little better. 这是相关问题的答案 ,看起来好一点。 It inherits
io.RawIOBase
and overrides readinto
. 它继承了
io.RawIOBase
并覆盖了readinto
。 In Python 3 it's sufficient, so instead of wrapping IterStream
in io.BufferedReader
one can wrap it in io.TextIOWrapper
. 在Python 3中它已经足够了,所以不是在
io.BufferedReader
中包装IterStream
, io.BufferedReader
可以将它包装在io.TextIOWrapper
。 In Python 2 read1
is needed but it can be simply expressed though readinto
. 在Python 2中需要
read1
,但它可以通过readinto
简单地表达。
this is exactly what stringIO is for .. 这正是stringIO的用途。
>>> import StringIO
>>> some_var = StringIO.StringIO("Hello World!")
>>> some_var.read(4)
'Hell'
>>> some_var.read(4)
'o Wo'
>>> some_var.read(4)
'rld!'
>>>
Or if you wanna do what it sounds like 或者,如果你想做它听起来像
Class MyString(StringIO.StringIO):
def __init__(self,*args):
StringIO.StringIO.__init__(self,"".join(args))
then you can simply 那么你可以简单
xx = MyString(*list_of_strings)
If you only need a read
method, then this can be enough如果你只需要一个
read
方法,那么这就足够了
def to_file_like_obj(iterable, base):
chunk = base()
offset = 0
it = iter(iterable)
def up_to_iter(size):
nonlocal chunk, offset
while size:
if offset == len(chunk):
try:
chunk = next(it)
except StopIteration:
break
else:
offset = 0
to_yield = min(size, len(chunk) - offset)
offset = offset + to_yield
size -= to_yield
yield chunk[offset - to_yield:offset]
class FileLikeObj:
def read(self, size=-1):
return base().join(up_to_iter(float('inf') if size is None or size < 0 else size))
return FileLikeObj()
which can be used for an iterable yielding str
可用于可迭代的产生
str
my_file = to_file_like_object(str_fn, str)
or if you have an iterable yielding bytes
rather than str
, and you want a file-like object whose read method returns bytes
或者,如果您有一个可迭代的产生
bytes
而不是str
,并且您想要一个类似文件的对象,其 read 方法返回bytes
my_file = to_file_like_object(bytes_fn, bytes)
This pattern has a few nice properties I think:我认为这种模式有一些不错的特性:
str
and bytes
str
和bytes
都可以用append
str
/ bytes
- so avoids copyingappend
str
/ bytes
- 所以避免复制str
/ bytes
that should be the entire instance will return exactly that same instancestr
/ bytes
切片将返回完全相同的实例 First of all, your generator will have to yield byte objects. 首先,您的生成器必须生成字节对象。 While there isn't anything built-in, you can use a combination of http://docs.python.org/library/stringio.html and itertools.chain.
虽然没有内置任何内容,但您可以使用http://docs.python.org/library/stringio.html和itertools.chain的组合。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.