[英]How can I create an efficient iterator of chars from stdin with Rust?
The corresponding issue for deprecation nicely sums up the problems with Read::chars
and offers suggestions: 相应的弃用问题很好地总结了
Read::chars
的问题并提供了建议:
Code that does not care about processing data incrementally can use
Read::read_to_string
instead.不关心增量处理数据的代码可以使用
Read::read_to_string
代替。 Code that does care presumably also wants to control its buffering strategy and work with&[u8]
and&str
slices that are as large as possible, rather than onechar
at a time.关心的代码可能还希望控制其缓冲策略并使用尽可能大的
&[u8]
和&str
切片,而不是一次一个char
。 It should be based on thestr::from_utf8
function as well as thevalid_up_to
anderror_len
methods of theUtf8Error
type.它应该基于
str::from_utf8
函数以及Utf8Error
类型的valid_up_to
和error_len
方法。 One tricky aspect is dealing with cases where a singlechar
is represented in UTF-8 by multiple bytes where those bytes happen to be split across separateread
calls / buffer chunks.一个棘手的方面是处理单个
char
在 UTF-8 中由多个字节表示的情况,这些字节碰巧被拆分为单独的read
调用/缓冲区块。 (Utf8Error::error_len
returningNone
indicates that this may be the case.) Theutf-8
crate solves this, but in order to be flexible provides an API that probably has too much surface to be included in the standard library.(
Utf8Error::error_len
返回None
表示可能是这种情况。)utf-8
crate解决了这个问题,但为了灵活提供了一个 API,该 API 可能有太多的表面无法包含在标准库中。Of course the above is for data that is always UTF-8.
当然,以上是针对始终为 UTF-8 的数据。 If other character encoding need to be supported, consider using the
encoding_rs
orencoding
crate.如果需要支持其他字符编码,请考虑使用
encoding_rs
或encoding
crate。
The most efficient solution in terms of number of I/O calls is to read everything into a giant buffer String
and iterate over that:就I/O 调用次数而言,最有效的解决方案是将所有内容读入一个巨大的缓冲区
String
并对其进行迭代:
use std::io::{self, Read};
fn main() {
let stdin = io::stdin();
let mut s = String::new();
stdin.lock().read_to_string(&mut s).expect("Couldn't read");
for c in s.chars() {
println!(">{}<", c);
}
}
You can combine this with an answer from Is there an owned version of String::chars?您可以将其与Is there an own version of String::chars?的答案结合起来。 :
:
use std::io::{self, Read};
fn reader_chars<R: Read>(mut rdr: R) -> io::Result<impl Iterator<Item = char>> {
let mut s = String::new();
rdr.read_to_string(&mut s)?;
Ok(s.into_chars()) // from https://stackoverflow.com/q/47193584/155423
}
fn main() -> io::Result<()> {
let stdin = io::stdin();
for c in reader_chars(stdin.lock())? {
println!(">{}<", c);
}
Ok(())
}
We now have a function that returns an iterator of char
s for any type that implements Read
.我们现在有一个函数,它为实现
Read
任何类型返回char
的迭代器。
Once you have this pattern, it's just a matter of deciding where to make the tradeoff of memory allocation vs I/O requests.有了这种模式后,只需决定在何处权衡内存分配与 I/O 请求。 Here's a similar idea that uses line-sized buffers:
这是使用行大小缓冲区的类似想法:
use std::io::{BufRead, BufReader, Read};
fn reader_chars<R: Read>(rdr: R) -> impl Iterator<Item = char> {
// We use 6 bytes here to force emoji to be segmented for demo purposes
// Pick more appropriate size for your case
let reader = BufReader::with_capacity(6, rdr);
reader
.lines()
.flat_map(|l| l) // Ignoring any errors
.flat_map(|s| s.into_chars()) // from https://stackoverflow.com/q/47193584/155423
}
fn main() {
// emoji are 4 bytes each
let data = "😻🧐🐪💩";
let data = data.as_bytes();
for c in reader_chars(data) {
println!(">{}<", c);
}
}
The far extreme would be to perform one I/O request for every character.极端的情况是为每个字符执行一个 I/O 请求。 This wouldn't take much memory, but would have a lot of I/O overhead.
这不会占用太多内存,但会产生大量 I/O 开销。
Copy and paste the implementation of Read::chars
into your own code.将
Read::chars
的实现复制并粘贴到您自己的代码中。 It will work as well as it used to.它会像以前一样工作。
See also:也可以看看:
As a couple others have mentioned, it is possible to copy the deprecated implementation of Read::chars
for use in your own code.正如其他几个人提到的,可以复制已弃用的
Read::chars
实现以在您自己的代码中使用。 Whether this is truly ideal or not will depend on your use-case--for me, this proved to be good enough for now although it is likely that my application will outgrow this approach in the near-future.这是否真的理想取决于您的用例——对我来说,这证明现在已经足够了,尽管我的应用程序在不久的将来可能会超越这种方法。
To illustrate how this can be done, let's look at a concrete example:为了说明如何做到这一点,让我们看一个具体的例子:
use std::io::{self, Error, ErrorKind, Read};
use std::result;
use std::str;
struct MyReader<R> {
inner: R,
}
impl<R: Read> MyReader<R> {
fn new(inner: R) -> MyReader<R> {
MyReader {
inner,
}
}
#[derive(Debug)]
enum MyReaderError {
NotUtf8,
Other(Error),
}
impl<R: Read> Iterator for MyReader<R> {
type Item = result::Result<char, MyReaderError>;
fn next(&mut self) -> Option<result::Result<char, MyReaderError>> {
let first_byte = match read_one_byte(&mut self.inner)? {
Ok(b) => b,
Err(e) => return Some(Err(MyReaderError::Other(e))),
};
let width = utf8_char_width(first_byte);
if width == 1 {
return Some(Ok(first_byte as char));
}
if width == 0 {
return Some(Err(MyReaderError::NotUtf8));
}
let mut buf = [first_byte, 0, 0, 0];
{
let mut start = 1;
while start < width {
match self.inner.read(&mut buf[start..width]) {
Ok(0) => return Some(Err(MyReaderError::NotUtf8)),
Ok(n) => start += n,
Err(ref e) if e.kind() == ErrorKind::Interrupted => continue,
Err(e) => return Some(Err(MyReaderError::Other(e))),
}
}
}
Some(match str::from_utf8(&buf[..width]).ok() {
Some(s) => Ok(s.chars().next().unwrap());
None => Err(MyReaderError::NotUtf8),
})
}
}
The above code also requires read_one_byte
and utf8_char_width
to be implemented.上面的代码还需要实现
read_one_byte
和utf8_char_width
。 Those should look something like:那些应该看起来像:
static UTF8_CHAR_WIDTH: [u8; 256] = [
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, // 0x1F
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, // 0x3F
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, // 0x5F
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, // 0x7F
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, // 0x9F
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, // 0xBF
0,0,2,2,2,2,2,2,2,2,2,2,2,2,2,2,
2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2, // 0xDF
3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3, // 0xEF
4,4,4,4,4,0,0,0,0,0,0,0,0,0,0,0, // 0xFF
];
fn utf8_char_width(b: u8) -> usize {
return UTF8_CHAR_WIDTH[b as usize] as usize;
}
fn read_one_byte(reader: &mut Read) -> Option<io::Result<u8>> {
let mut buf = [0];
loop {
return match reader.read(&mut buf) {
Ok(0) => None,
Ok(..) => Some(Ok(buf[0])),
Err(ref e) if e.kind() == ErrorKind::Interrupted => continue,
Err(e) => Some(Err(e)),
};
}
}
Now we can use the MyReader
implementation to produce an iterator of char
s over some reader, like io::stdin::Stdin
:现在我们可以使用
MyReader
实现在某些读取器上生成char
的迭代器,例如io::stdin::Stdin
:
fn main() {
let stdin = io::stdin();
let mut reader = MyReader::new(stdin.lock());
for c in reader {
println!("{}", c);
}
}
The limitations of this approach are discussed at length in the original issue thread .在原始问题线程中详细讨论了这种方法的局限性。 One particular concern worth pointing out however is that this iterator will not handle non-UTF-8 encoded streams correctly.
然而,值得指出的一个特别问题是该迭代器将无法正确处理非 UTF-8 编码的流。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.