如何使用 Rust 從 stdin 創建一個高效的字符迭代器？

Question

現在Read::chars迭代器已被正式棄用，在不將整個流讀入內存的情況下，獲取來自像 stdin 這樣的Reader的字符的迭代器的正確方法是什么？

Answer 1

相應的棄用問題很好地總結了Read::chars的問題並提供了建議：

不關心增量處理數據的代碼可以使用Read::read_to_string代替。 關心的代碼可能還希望控制其緩沖策略並使用盡可能大的&[u8]和&str切片，而不是一次一個char 。 它應該基於str::from_utf8函數以及Utf8Error類型的valid_up_to和error_len方法。 一個棘手的方面是處理單個char在 UTF-8 中由多個字節表示的情況，這些字節碰巧被拆分為單獨的read調用/緩沖區塊。 （ Utf8Error::error_len返回None表示可能是這種情況。） utf-8 crate解決了這個問題，但為了靈活提供了一個 API，該 API 可能有太多的表面無法包含在標准庫中。

當然，以上是針對始終為 UTF-8 的數據。 如果需要支持其他字符編碼，請考慮使用encoding_rs或encoding crate。

你自己的迭代器

就I/O 調用次數而言，最有效的解決方案是將所有內容讀入一個巨大的緩沖區String並對其進行迭代：

use std::io::{self, Read};

fn main() {
    let stdin = io::stdin();
    let mut s = String::new();
    stdin.lock().read_to_string(&mut s).expect("Couldn't read");
    for c in s.chars() {
        println!(">{}<", c);
    }
}

您可以將其與Is there an own version of String::chars?的答案結合起來。 ：

use std::io::{self, Read};

fn reader_chars<R: Read>(mut rdr: R) -> io::Result<impl Iterator<Item = char>> {
    let mut s = String::new();
    rdr.read_to_string(&mut s)?;
    Ok(s.into_chars()) // from https://stackoverflow.com/q/47193584/155423
}

fn main() -> io::Result<()> {
    let stdin = io::stdin();

    for c in reader_chars(stdin.lock())? {
        println!(">{}<", c);
    }

    Ok(())
}

我們現在有一個函數，它為實現Read任何類型返回char的迭代器。

有了這種模式后，只需決定在何處權衡內存分配與 I/O 請求。 這是使用行大小緩沖區的類似想法：

use std::io::{BufRead, BufReader, Read};

fn reader_chars<R: Read>(rdr: R) -> impl Iterator<Item = char> {
    // We use 6 bytes here to force emoji to be segmented for demo purposes
    // Pick more appropriate size for your case
    let reader = BufReader::with_capacity(6, rdr);

    reader
        .lines()
        .flat_map(|l| l) // Ignoring any errors
        .flat_map(|s| s.into_chars())  // from https://stackoverflow.com/q/47193584/155423
}

fn main() {
    // emoji are 4 bytes each
    let data = "😻🧐🐪💩";
    let data = data.as_bytes();

    for c in reader_chars(data) {
        println!(">{}<", c);
    }
}

極端的情況是為每個字符執行一個 I/O 請求。 這不會占用太多內存，但會產生大量 I/O 開銷。

務實的回答

將Read::chars的實現復制並粘貼到您自己的代碼中。 它會像以前一樣工作。

也可以看看：

Answer 2

正如其他幾個人提到的，可以復制已棄用的Read::chars實現以在您自己的代碼中使用。 這是否真的理想取決於您的用例——對我來說，這證明現在已經足夠了，盡管我的應用程序在不久的將來可能會超越這種方法。

為了說明如何做到這一點，讓我們看一個具體的例子：

use std::io::{self, Error, ErrorKind, Read};
use std::result;
use std::str;

struct MyReader<R> {
    inner: R,
}

impl<R: Read> MyReader<R> {
    fn new(inner: R) -> MyReader<R> {
        MyReader {
            inner,
        }
    }

#[derive(Debug)]
enum MyReaderError {
    NotUtf8,
    Other(Error),
}

impl<R: Read> Iterator for MyReader<R> {
    type Item = result::Result<char, MyReaderError>;

    fn next(&mut self) -> Option<result::Result<char, MyReaderError>> {
        let first_byte = match read_one_byte(&mut self.inner)? {
            Ok(b) => b,
            Err(e) => return Some(Err(MyReaderError::Other(e))),
        };
        let width = utf8_char_width(first_byte);
        if width == 1 {
            return Some(Ok(first_byte as char));
        }
        if width == 0 {
            return Some(Err(MyReaderError::NotUtf8));
        }
        let mut buf = [first_byte, 0, 0, 0];
        {
            let mut start = 1;
            while start < width {
                match self.inner.read(&mut buf[start..width]) {
                    Ok(0) => return Some(Err(MyReaderError::NotUtf8)),
                    Ok(n) => start += n,
                    Err(ref e) if e.kind() == ErrorKind::Interrupted => continue,
                    Err(e) => return Some(Err(MyReaderError::Other(e))),
                }
            }
        }
        Some(match str::from_utf8(&buf[..width]).ok() {
            Some(s) => Ok(s.chars().next().unwrap());
            None => Err(MyReaderError::NotUtf8),
        })
    }
}

上面的代碼還需要實現read_one_byte和utf8_char_width 。 那些應該看起來像：

static UTF8_CHAR_WIDTH: [u8; 256] = [
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, // 0x1F
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, // 0x3F
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, // 0x5F
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, // 0x7F
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, // 0x9F
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, // 0xBF
0,0,2,2,2,2,2,2,2,2,2,2,2,2,2,2,
2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2, // 0xDF
3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3, // 0xEF
4,4,4,4,4,0,0,0,0,0,0,0,0,0,0,0, // 0xFF
];

fn utf8_char_width(b: u8) -> usize {
    return UTF8_CHAR_WIDTH[b as usize] as usize;
}

fn read_one_byte(reader: &mut Read) -> Option<io::Result<u8>> {
    let mut buf = [0];
    loop {
        return match reader.read(&mut buf) {
            Ok(0) => None,
            Ok(..) => Some(Ok(buf[0])),
            Err(ref e) if e.kind() == ErrorKind::Interrupted => continue,
            Err(e) => Some(Err(e)),
        };
    }
}

現在我們可以使用MyReader實現在某些讀取器上生成char的迭代器，例如io::stdin::Stdin ：

fn main() {
    let stdin = io::stdin();
    let mut reader = MyReader::new(stdin.lock());
    for c in reader {
        println!("{}", c);
    }
}

在原始問題線程中詳細討論了這種方法的局限性。 然而，值得指出的一個特別問題是該迭代器將無法正確處理非 UTF-8 編碼的流。

如何使用 Rust 從 stdin 創建一個高效的字符迭代器？

問題描述

2 個解決方案

解決方案1
10 2018-05-17 16:33:03

你自己的迭代器

務實的回答

解決方案2
4 已采納 2018-05-20 22:10:08

如何使用 Rust 從 stdin 創建一個高效的字符迭代器？

問題描述

2 個解決方案

解決方案1 10 2018-05-17 16:33:03

你自己的迭代器

務實的回答

解決方案2 4 已采納 2018-05-20 22:10:08

解決方案1
10 2018-05-17 16:33:03

解決方案2
4 已采納 2018-05-20 22:10:08