[英]How to use yaml.load_all with fileinput.input?
Without resorting to ''.join
, is there a Pythonic way to use PyYAML's yaml.load_all
with fileinput.input()
for easy streaming of multiple documents from multiple sources? 如果不使用''.join
,是否有Pythonic方法使用PyYAML的yaml.load_all
和fileinput.input()
来轻松地从多个来源流式传输多个文档?
I'm looking for something like the following (non-working example): 我正在寻找类似以下内容(非工作示例):
# example.py
import fileinput
import yaml
for doc in yaml.load_all(fileinput.input()):
print(doc)
Expected output: 预期产量:
$ cat >pre.yaml <<<'--- prefix-doc'
$ cat >post.yaml <<<'--- postfix-doc'
$ python example.py pre.yaml - post.yaml <<<'--- hello'
prefix-doc
hello
postfix-doc
Of course, yaml.load_all
expects either a string, bytes, or a file-like object and fileinput.input()
is none of those things, so the above example does not work. 当然, yaml.load_all
字符串,字节或类文件对象,而fileinput.input()
不是那些东西,所以上面的例子不起作用。
Actual output: 实际产量:
$ python example.py pre.yaml - post.yaml <<<'--- hello'
...
AttributeError: FileInput instance has no attribute 'read'
You can make the example work with ''.join
, but that's cheating. 你可以让这个例子与''.join
,但这是作弊。 I'm looking for a way that does not read the entire stream into memory at once. 我正在寻找一种不会立即将整个流读入内存的方法。
We might rephrase the question as Is there some way to emulate a string, bytes, or file-like object that proxies to an underlying iterator of strings? 我们可能会重新解释这个问题,因为有没有办法模拟字符串,字节或类似文件的对象代理到字符串的底层迭代器? However, I doubt that yaml.load_all
actually needs the entire file-like interface, so that phrasing would ask for more than is strictly necessary. 但是,我怀疑yaml.load_all
实际上需要整个类似文件的接口,因此短语要求的不仅仅是严格必要的。
Ideally I'm looking for the minimal adapter that would support something like this: 理想情况下,我正在寻找支持这样的最小适配器:
for doc in yaml.load_all(minimal_adapter(fileinput.input())):
print(doc)
The problem with fileinput.input
is that the resulting object doesn't have a read
method, which is what yaml.load_all
is looking for. fileinput.input
的问题是生成的对象没有read
方法,这就是yaml.load_all
正在寻找的方法。 If you're willing to give up fileinput
, you can just write your own class that will do what you want: 如果您愿意放弃fileinput
,您可以编写自己的类来执行您想要的操作:
import sys
import yaml
class BunchOFiles (object):
def __init__(self, *files):
self.files = files
self.fditer = self._fditer()
self.fd = self.fditer.next()
def _fditer(self):
for fn in self.files:
with sys.stdin if fn == '-' else open(fn, 'r') as fd:
yield fd
def read(self, size=-1):
while True:
data = self.fd.read(size)
if data:
break
else:
try:
self.fd = self.fditer.next()
except StopIteration:
self.fd = None
break
return data
bunch = BunchOFiles(*sys.argv[1:])
for doc in yaml.load_all(bunch):
print doc
The BunchOFiles
class gets you an object with a read
method that will happily iterate over a list of files until everything is exhausted. BunchOFiles
类为您提供一个带有read
方法的对象,该方法可以BunchOFiles
地迭代文件列表,直到所有内容都用完为止。 Given the above code and your sample input, we get exactly the output you're looking for. 鉴于上面的代码和您的示例输入,我们得到您正在寻找的输出。
Your minimal_adapter
should take a fileinput.FileInput
as a parameter and return an object which load_all
can use. 您的minimal_adapter
应该将fileinput.FileInput
作为参数并返回load_all
可以使用的对象。 load_all
either takes as an argument a string, but that would require concatenating the input, or it expects the argument to have a read()
method. load_all
要么作为参数接受字符串,但是这需要连接输入,或者它要求参数具有read()
方法。
Since your minimal_adapter needs to preserve some state, I find it clearest/easiest to implement it as an instance of a class that has a __call__
method, and have that method return the instance and store its argument for future use. 由于你的minimal_adapter需要保留一些状态,我发现它最清楚/最容易实现它作为具有__call__
方法的类的实例,并且让该方法返回实例并存储其参数以供将来使用。 Implemented that way, the class should also have a read()
method, as this will be called after handing the instance to load_all
: 实现这种方式,该类还应该有一个read()
方法,因为这将在将实例交给load_all
之后调用:
import fileinput
import ruamel.yaml
class MinimalAdapter:
def __init__(self):
self._fip = None
self._buf = None # storage of read but unused material, maximum one line
def __call__(self, fip):
self._fip = fip # store for future use
self._buf = ""
return self
def read(self, size):
if len(self._buf) >= size:
# enough in buffer from last read, just cut it off and return
tmp, self._buf = self._buf[:size], self._buf[size:]
return tmp
for line in self._fip:
self._buf += line
if len(self._buf) > size:
break
else:
# ran out of lines, return what we have
tmp, self._buf = self._buf, ''
return tmp
tmp, self._buf = self._buf[:size], self._buf[size:]
return tmp
minimal_adapter = MinimalAdapter()
for doc in ruamel.yaml.load_all(minimal_adapter(fileinput.input())):
print(doc)
With this, running your example invocation exactly gives the output that you want. 有了这个,运行示例调用会精确地提供所需的输出。
This is probably only more memory efficient for larger files. 对于较大的文件,这可能只是更高效的内存。 The load_all
tries to read 1024 byte blocks at a time (easily found out by putting a print statement in MinimalAdapter.read()
) and fileinput
does some buffering as well (use strace
if your interested to find out how it behaves). load_all
尝试一次读取1024个字节块(通过在MinimalAdapter.read()
放置一个print语句很容易找到), fileinput
一些缓冲(如果你有兴趣了解它的行为,请使用strace
)。
This was done using ruamel.yaml a YAML 1.2 parser, of which I am the author. 这是使用ruamel.yaml和YAML 1.2解析器完成的,我是作者。 This should work for PyYAML, of which ruamel.yaml is a derived superset, as well. 这适用于PyYAML,其中ruamel.yaml也是派生超集。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.