简体   繁体   English

python 3:使用预读从标准输入管道读取字节

[英]python 3: reading bytes from stdin pipe with readahead

i want to read bytes.我想读取字节。 sys.stdin is opened in textmode, yet it has a buffer that can be used to read bytes: sys.stdin.buffer . sys.stdin在文本模式下打开,但它有一个可用于读取字节的缓冲区: sys.stdin.buffer

my problem is that when i pipe data into python i only seem to have 2 options if i want readahead, else i get a io.UnsupportedOperation: File or stream is not seekable.我的问题是,当我将数据通过管道传输到 python 时,如果我想要预读,我似乎只有 2 个选项,否则我会得到一个io.UnsupportedOperation: File or stream is not seekable.

  1. reading buffered text from sys.stdin , decoding that text to bytes, and seeking backsys.stdin读取缓冲文本,将该文本解码为字节,并返回

    ( sys.stdin.read(1).decode(); sys.stdin.seek(-1, io.SEEK_CUR) . ( sys.stdin.read(1).decode(); sys.stdin.seek(-1, io.SEEK_CUR)

    unacceptable due to non-encodable bytes in the input stream.由于输入流中的字节不可编码,因此不可接受。

  2. using peek to get some bytes from the stdin's buffer, slicing that to the appropriate number, and praying, as peek doesn't guarantee anything: it may give less or more than you request…使用peek从 stdin 的缓冲区中获取一些字节,将其切成适当的数字,然后祈祷,因为peek不能保证任何事情:它可能会比您要求的少或多……

    ( sys.stdin.buffer.peek(1)[:1] ) ( sys.stdin.buffer.peek(1)[:1] )

    peek is really underdocumented and gives you a bunch of bytes that you have to performance-intensively slice. peek 确实没有充分记录,并为您提供了一堆必须对性能进行密集切片的字节。

btw.顺便提一句。 that error really only applies when piping: for ./myscript.py <somefile , sys.stdin.buffer supports seeking.该错误实际上仅适用于管道:对于./myscript.py <somefilesys.stdin.buffer支持查找。 yet the sys.stdin is always the same hierarchy of objects:然而sys.stdin始终是相同的对象层次结构:

$ cat testio.py
#!/usr/bin/env python3
from sys import stdin
print(stdin)
print(stdin.buffer)
print(stdin.buffer.raw)"
$ ./testio.py
<_io.TextIOWrapper name='<stdin>' mode='r' encoding='UTF-8'>
<_io.BufferedReader name='<stdin>'>
<_io.FileIO name='<stdin>' mode='rb'>
$ ./testio.py <somefile
[the same as above]
$ echo hi | ./testio.py
[the same as above]

some initial ideas like wrapping the byte stream into a random access buffer fail with the same error as mentioned above: BufferedRandom(sys.stdin.buffer).seek(0)io.UnsupportedOperation…一些最初的想法,比如将字节流包装到随机访问缓冲区中失败,并出现与上述相同的错误: BufferedRandom(sys.stdin.buffer).seek(0)io.UnsupportedOperation…

finally, for your convenience i present:最后,为了您的方便,我提出:

Python's io class hierarchy Python 的 io 类层次结构

IOBase
├RawIOBase
│└FileIO
├BufferedIOBase  (buffers a RawIOBase)
│├BufferedWriter┐ 
│├BufferedReader│
││        └─────┴BufferedRWPair
│├BufferedRandom (implements seeking)
│└BytesIO        (wraps a bytes)
└TextIOBase
 ├TextIOWrapper  (wraps a BufferedIOBase)
 └TextIO         (wraps a str)

and in case you forgot the question: how do i get the next byte from stdin without de/encoding anything, and without advancing the stream's cursor?如果您忘记了这个问题:如何在不解码/编码任何内容且不推进流光标的情况下从 stdin 获取下一个字节?

The exception doesn't come from Python, but from the operating system, which doesn't allow seeking on pipes.异常不是来自 Python,而是来自不允许在管道上查找的操作系统。 (If you redirect output from a regular pipe, it can be seeked, even though it's standard input.) This is why you get the error in one case and not in the other, even though the classes are the same. (如果您重定向来自常规管道的输出,即使它是标准输入,它也可以被搜索。)这就是为什么您在一种情况下会出现错误而在另一种情况下不会出现错误,即使类是相同的。

The classic Python 2 solution for readahead would be to wrap the stream in your own stream implementation that implements readahead:用于预读的经典 Python 2 解决方案是将流包装在您自己的实现预读的流实现中:

class Peeker(object):
    def __init__(self, fileobj):
        self.fileobj = fileobj
        self.buf = cStringIO.StringIO()

    def _append_to_buf(self, contents):
        oldpos = self.buf.tell()
        self.buf.seek(0, os.SEEK_END)
        self.buf.write(contents)
        self.buf.seek(oldpos)

    def peek(self, size):
        contents = self.fileobj.read(size)
        self._append_to_buf(contents)
        return contents

    def read(self, size=None):
        if size is None:
            return self.buf.read() + self.fileobj.read()
        contents = self.buf.read(size)
        if len(contents) < size:
            contents += self.fileobj.read(size - len(contents))
        return contents

    def readline(self):
        line = self.buf.readline()
        if not line.endswith('\n'):
            line += self.fileobj.readline()
        return line

sys.stdin = Peeker(sys.stdin)

In Python 3 supporting the full sys.stdin while peeking the undecoded stream is complicated—one would wrap stdin.buffer as shown above, then instantiate a new TextIOWrapper over your peekable stream, and install that TextIOWrapper as sys.stdin .在 Python 3 中,支持完整的sys.stdinsys.stdin未解码的流很复杂——如上所示将stdin.buffer包装stdin.buffer ,然后在可查看的流上实例化一个新的TextIOWrapper ,并将该TextIOWrapper安装为sys.stdin

However, since you only need to peek at sys.stdin.buffer , the above code will work just fine, after changing cStringIO.StringIO to io.BytesIO and '\\n' to b'\\n' .但是,由于您只需要查看sys.stdin.buffer ,因此在将cStringIO.StringIO更改为io.BytesIO并将'\\n'更改为b'\\n'之后,上面的代码将可以正常工作。

user4815162342's solution, while extremely useful, appears to have an issue in that it differs from the current behaviour of the io.BufferedReader peek method. user4815162342 的解决方案虽然非常有用,但似乎存在一个问题,因为它与 io.BufferedReader peek 方法的当前行为不同。

The builtin method will return the same data (starting from the current read position) for sequential peek() calls.内置方法将为顺序 peek() 调用返回相同的数据(从当前读取位置开始)。

user4815162342's solution will return sequential chunks of data for each sequential peek call. user4815162342 的解决方案将为每个连续的 peek 调用返回连续的数据块。 This implies the user must wrap peek again to concatenate the output if they wish to use the same data more than once.这意味着如果用户希望多次使用相同的数据,他们必须再次包装 peek 以连接输出。

Here is the fix to return builtin behaviour:这是返回内置行为的修复程序:

def _buffered(self):
    oldpos = self.buf.tell()
    data = self.buf.read()
    self.buf.seek(oldpos)
    return data

def peek(self, size):
    buf = self._buffered()[:size]
    if len(buf) < size:
        contents = self.fileobj.read(size - len(buf))
        self._append_to_buf(contents)
        return self._buffered()
    return buf

See the full version here 在此处查看完整版本

There are other optimisations that could be applied, eg removal of previously buffered data upon a read call that exhausts the buffer.可以应用其他优化,例如,在耗尽缓冲区的读取调用时移除先前缓冲的数据。 The current implementation leaves any peeked data in the buffer, but that data is inaccessible.当前的实现会在缓冲区中留下任何偷看的数据,但这些数据是不可访问的。

Try this:试试这个:

import sys

ssb = sys.stdin.buffer.read(1)
if ssb == b'h':
    print(ssb+sys.stdin.buffer.read())

Echo a string:回显一个字符串:

a@fuhq:~$ echo 'hi' | python3 buf_test.py 
b'hi\n'

Redirect a file:重定向文件:

a@fuhq:~$ cat hi.text
hi
a@fuhq:~$ python3 buf_test.py   <  hi.text
b'hi\n'

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM