如何在fileinput.input中使用yaml.load_all？

Question

Without resorting to ''.join , is there a Pythonic way to use PyYAML's yaml.load_all with fileinput.input() for easy streaming of multiple documents from multiple sources? 如果不使用''.join ，是否有Pythonic方法使用PyYAML的yaml.load_all和fileinput.input()来轻松地从多个来源流式传输多个文档？

I'm looking for something like the following (non-working example): 我正在寻找类似以下内容（非工作示例）：

# example.py
import fileinput

import yaml

for doc in yaml.load_all(fileinput.input()):
    print(doc)

Expected output: 预期产量：

$ cat >pre.yaml <<<'--- prefix-doc'
$ cat >post.yaml <<<'--- postfix-doc'
$ python example.py pre.yaml - post.yaml <<<'--- hello'
prefix-doc
hello
postfix-doc

Of course, yaml.load_all expects either a string, bytes, or a file-like object and fileinput.input() is none of those things, so the above example does not work. 当然， yaml.load_all字符串，字节或类文件对象，而fileinput.input()不是那些东西，所以上面的例子不起作用。

Actual output: 实际产量：

$ python example.py pre.yaml - post.yaml <<<'--- hello'
...
AttributeError: FileInput instance has no attribute 'read'

You can make the example work with ''.join , but that's cheating. 你可以让这个例子与''.join ，但这是作弊。 I'm looking for a way that does not read the entire stream into memory at once. 我正在寻找一种不会立即将整个流读入内存的方法。

We might rephrase the question as Is there some way to emulate a string, bytes, or file-like object that proxies to an underlying iterator of strings? 我们可能会重新解释这个问题，因为有没有办法模拟字符串，字节或类似文件的对象代理到字符串的底层迭代器？ However, I doubt that yaml.load_all actually needs the entire file-like interface, so that phrasing would ask for more than is strictly necessary. 但是，我怀疑yaml.load_all实际上需要整个类似文件的接口，因此短语要求的不仅仅是严格必要的。

Ideally I'm looking for the minimal adapter that would support something like this: 理想情况下，我正在寻找支持这样的最小适配器：

for doc in yaml.load_all(minimal_adapter(fileinput.input())):
    print(doc)

Answer 1

The problem with fileinput.input is that the resulting object doesn't have a read method, which is what yaml.load_all is looking for. fileinput.input的问题是生成的对象没有read方法，这就是yaml.load_all正在寻找的方法。 If you're willing to give up fileinput , you can just write your own class that will do what you want: 如果您愿意放弃fileinput ，您可以编写自己的类来执行您想要的操作：

import sys                                                                      
import yaml                                                                     

class BunchOFiles (object):                                                     
    def __init__(self, *files):                                                 
        self.files = files                                                      
        self.fditer = self._fditer()                                            
        self.fd = self.fditer.next()                                            

    def _fditer(self):                                                          
        for fn in self.files:                                                   
            with sys.stdin if fn == '-' else open(fn, 'r') as fd:               
                yield fd                                                        

    def read(self, size=-1):                                                    
        while True:                                                             
            data = self.fd.read(size)                                           

            if data:                                                            
                break                                                           
            else:                                                               
                try:                                                            
                    self.fd = self.fditer.next()                                
                except StopIteration:                                           
                    self.fd = None                                              
                    break                                                       

        return data                                                             

bunch = BunchOFiles(*sys.argv[1:])                                              
for doc in yaml.load_all(bunch):                                                
    print doc

The BunchOFiles class gets you an object with a read method that will happily iterate over a list of files until everything is exhausted. BunchOFiles类为您提供一个带有read方法的对象，该方法可以BunchOFiles地迭代文件列表，直到所有内容都用完为止。 Given the above code and your sample input, we get exactly the output you're looking for. 鉴于上面的代码和您的示例输入，我们得到您正在寻找的输出。

Answer 2

Your minimal_adapter should take a fileinput.FileInput as a parameter and return an object which load_all can use. 您的minimal_adapter应该将fileinput.FileInput作为参数并返回load_all可以使用的对象。 load_all either takes as an argument a string, but that would require concatenating the input, or it expects the argument to have a read() method. load_all要么作为参数接受字符串，但是这需要连接输入，或者它要求参数具有read()方法。

Since your minimal_adapter needs to preserve some state, I find it clearest/easiest to implement it as an instance of a class that has a __call__ method, and have that method return the instance and store its argument for future use. 由于你的minimal_adapter需要保留一些状态，我发现它最清楚/最容易实现它作为具有__call__方法的类的实例，并且让该方法返回实例并存储其参数以供将来使用。 Implemented that way, the class should also have a read() method, as this will be called after handing the instance to load_all : 实现这种方式，该类还应该有一个read()方法，因为这将在将实例交给load_all之后调用：

import fileinput
import ruamel.yaml


class MinimalAdapter:
    def __init__(self):
        self._fip = None
        self._buf = None  # storage of read but unused material, maximum one line

    def __call__(self, fip):
        self._fip = fip  # store for future use
        self._buf = ""
        return self

    def read(self, size):
        if len(self._buf) >= size:
            # enough in buffer from last read, just cut it off and return
            tmp, self._buf = self._buf[:size], self._buf[size:]
            return tmp
        for line in self._fip:
            self._buf += line
            if len(self._buf) > size:
                break
        else:
            # ran out of lines, return what we have
            tmp, self._buf = self._buf, ''
            return tmp
        tmp, self._buf = self._buf[:size], self._buf[size:]
        return tmp


minimal_adapter = MinimalAdapter()

for doc in ruamel.yaml.load_all(minimal_adapter(fileinput.input())):
    print(doc)

With this, running your example invocation exactly gives the output that you want. 有了这个，运行示例调用会精确地提供所需的输出。

This is probably only more memory efficient for larger files. 对于较大的文件，这可能只是更高效的内存。 The load_all tries to read 1024 byte blocks at a time (easily found out by putting a print statement in MinimalAdapter.read() ) and fileinput does some buffering as well (use strace if your interested to find out how it behaves). load_all尝试一次读取1024个字节块（通过在MinimalAdapter.read()放置一个print语句很容易找到）， fileinput一些缓冲（如果你有兴趣了解它的行为，请使用strace ）。

_{This was done using ruamel.yaml a YAML 1.2 parser, of which I am the author.} _{这是使用ruamel.yaml和YAML 1.2解析器完成的，我是作者。} _{This should work for PyYAML, of which ruamel.yaml is a derived superset, as well.} _{这适用于PyYAML，其中ruamel.yaml也是派生超集。}

如何在fileinput.input中使用yaml.load_all？

问题描述

2 个解决方案

解决方案1
4 2016-09-07 02:41:55

解决方案2
3 已采纳 2016-09-07 05:06:08

如何在fileinput.input中使用yaml.load_all？

问题描述

2 个解决方案

解决方案1 4 2016-09-07 02:41:55

解决方案2 3 已采纳 2016-09-07 05:06:08

解决方案1
4 2016-09-07 02:41:55

解决方案2
3 已采纳 2016-09-07 05:06:08