將字節可迭代轉換為 str 的可迭代，其中每個值是一行

Question

我有一個可迭代的bytes ，例如

bytes_iter = (
    b'col_1,',
    b'c',
    b'ol_2\n1',
    b',"val',
    b'ue"\n',
)

（但通常這不會被硬編碼或一次全部可用，而是由生成器提供）並且我想將其轉換為str行的可迭代，其中換行符在前面是未知的，但可以是\r中的任何一個, \n或\r\n 。 所以在這種情況下將是：

lines_iter = (
    'col_1,col_2',
    '1,"value"',
)

（但同樣，就像一個可迭代的，不是所有的都在 memory 一次）。

我怎樣才能做到這一點？

上下文：我的目標是將 str 行的可迭代傳遞給csv.reader （我認為需要整行？），但我一般對這個答案感興趣。

Answer 1

使用io模塊為您完成大部分工作：

class ReadableIterator(io.IOBase):
    def __init__(self, it):
        self.it = iter(it)
    def read(self, n):
        # ignore argument, nobody actually cares
        # note that it is *critical* that we suppress the `StopIteration` here
        return next(self.it, b'')
    def readable(self):
        return True

然后只需調用io.TextIOWrapper(ReadableIterator(some_iterable_of_bytes)) 。

Answer 2

我使用了yield和re.split 。

yield 表達式在定義生成器 function 或異步生成器 function 時使用，因此只能在 function 定義的主體中使用。 在函數體中使用 yield 表達式會導致 function 成為生成器 function

(?:\r\n)|(?:\r)|(?:\n)表示匹配\r ， \n或\r\n 。

import re
split_rule = re.compile("(?:\r\n)|(?:\r)|(?:\n)")


def converter(byte_data):
    left_d = ""
    for d in byte_data:
        t = split_rule.split(left_d + d.decode())
        left_d = ""
        last_index = len(t) - 1
        for index, i in enumerate(t):
            if not i:
                continue
            if index != last_index:
                yield i
            else:
                left_d = i
    else:
        if left_d:
            yield left_d


for i in (converter(iter((
    b'col_1,',
    b'c',
    b'ol_2\n1',
    b',"val;',
    b'ue"\n')))
):
    print(i)

Output：

col_1,col_2
1,"val;ue"

Answer 3

也許我錯過了一些重要（或微妙）的東西，因為一些贊成的答案似乎比這更奇特，但我認為你可以解碼和鏈接字節並使用itertools.groupby來獲取字符串生成器：

from itertools import groupby, chain

bytes_iter = (
    b'col_1,',
    b'c',
    b'ol_2\n',
    b'1,"val;',
    b'ue"\n'
)

def make_strings(G):
    strings = chain.from_iterable(map(bytes.decode, G))
    for k, g in groupby(strings, key=lambda c: c not in '\n\r'):
        if k:
            yield ''.join(g)                            

list(make_strings(bytes_iter))
# ['col_1,col_2', '1,"val;ue"']

Answer 4

將 @o11c 和 @user2357112 放在一起支持 Monica 的貢獻：

import codecs
import csv
import io

def yield_bytes():
    chunks = [
        b'col_1,',
        b'c',
        b'ol_2\n1',
        b',"val',
        b'ue"\n',
        b'Hello,'
        b'\xe4\xb8',
        b'\x96',
        b'\xe7',
        b'\x95\x8c\n'
        b'\n'
    ]

    for chunk in chunks:
        yield(chunk)

decoder = codecs.getincrementaldecoder('utf-8')()

def yield_encoded_bytes():
    s = None
    for bytes in yield_bytes():
        s = decoder.decode(bytes, final=False)
        if s:
            yield s.encode('utf-8')

class ReadableIterator(io.IOBase):
    def __init__(self, it):
        self.it = iter(it)
    def read(self, n):
        # ignore argument, nobody actually cares
        # note that it is *critical* that we suppress the `StopIteration` here
        return next(self.it, b'')
    def readable(self):
        return True

f = io.TextIOWrapper(ReadableIterator(yield_encoded_bytes()))

for row in csv.reader(f):
    print(row)

我得到：

['col_1', 'col_2']
['1', 'value']
['Hello', '世界']
[]

將字節可迭代轉換為 str 的可迭代，其中每個值是一行

問題描述

4 個解決方案

解決方案1
3 2022-01-09 08:27:12

解決方案2
2 2022-01-09 08:24:43

解決方案3
0 2022-01-09 08:43:39

解決方案4
0 2022-01-10 05:45:22

將字節可迭代轉換為 str 的可迭代，其中每個值是一行

問題描述

4 個解決方案

解決方案1 3 2022-01-09 08:27:12

解決方案2 2 2022-01-09 08:24:43

解決方案3 0 2022-01-09 08:43:39

解決方案4 0 2022-01-10 05:45:22

解決方案1
3 2022-01-09 08:27:12

解決方案2
2 2022-01-09 08:24:43

解決方案3
0 2022-01-09 08:43:39

解決方案4
0 2022-01-10 05:45:22