简体   繁体   English

如何将 Chars 迭代器存储在与它正在迭代的 String 相同的结构中?

[英]How can I store a Chars iterator in the same struct as the String it is iterating on?

I am just beginning to learn Rust and I'm struggling to handle the lifetimes.我刚刚开始学习 Rust,我正在努力处理生命周期。

I'd like to have a struct with a String in it which will be used to buffer lines from stdin.我想要一个带有String的结构,它将用于缓冲来自标准输入的行。 Then I'd like to have a method on the struct which returns the next character from the buffer, or if all of the characters from the line have been consumed it will read the next line from stdin.然后我想在结构上有一个方法,它从缓冲区返回下一个字符,或者如果该行中的所有字符都已被消耗,它将从标准输入读取下一行。

The documentation says that Rust strings aren't indexable by character because that is inefficient with UTF-8.文档说 Rust 字符串不能按字符索引,因为 UTF-8 效率低下。 As I'm accessing the characters sequentially it should be fine to use an iterator.当我按顺序访​​问字符时,使用迭代器应该没问题。 However, as far as I understand, iterators in Rust are tied to the lifetime of the thing they're iterating and I can't work out how I could store this iterator in the struct alongside the String .但是,据我所知,Rust 中的迭代器与它们正在迭代的事物的生命周期相关联,我无法弄清楚如何将此迭代器与String一起存储在结构中。

Here is the pseudo-Rust that I'd like to achieve.这是我想要实现的伪 Rust。 Obviously it doesn't compile.显然它不会编译。

struct CharGetter {
    /* Buffer containing one line of input at a time */
    input_buf: String,
    /* The position within input_buf of the next character to
     * return. This needs a lifetime parameter. */
    input_pos: std::str::Chars
}

impl CharGetter {
    fn next(&mut self) -> Result<char, io::Error> {
        loop {
            match self.input_pos.next() {
                /* If there is still a character left in the input
                 * buffer then we can just return it immediately. */
                Some(n) => return Ok(n),
                /* Otherwise get the next line */
                None => {
                    io::stdin().read_line(&mut self.input_buf)?;
                    /* Reset the iterator to the beginning of the
                     * line. Obviously this doesn’t work because it’s
                     * not obeying the lifetime of input_buf */
                    self.input_pos = self.input_buf.chars();
                }
            }
        }
    }
}

I am trying to do the Synacor challenge .我正在尝试进行Synacor 挑战 This involves implementing a virtual machine where one of the opcodes reads a character from stdin and stores it in a register.这涉及实现一个虚拟机,其中一个操作码从 stdin 读取字符并将其存储在寄存器中。 I have this part working fine.我有这部分工作正常。 The documentation states that whenever the program inside the VM reads a character it will keep reading until it reads a whole line.文档指出,无论何时 VM 内的程序读取一个字符,它都会一直读取直到读取整行。 I wanted to take advantage of this to add a “save” command to my implementation.我想利用这一点在我的实现中添加一个“保存”命令。 That means that whenever the program asks for a character, I will read a line from the input.这意味着每当程序要求输入一个字符时,我都会从输入中读取一行。 If the line is “save”, I will save the state of the VM and then continue to get another line to feed to the VM.如果该行是“save”,我将保存 VM 的状态,然后继续获取另一行以提供给 VM。 Each time the VM executes the input opcode, I need to be able to give it one character at a time from the buffered line until the buffer is depleted.每次 VM 执行输入操作码时,我需要能够从缓冲行开始一次给它一个字符,直到缓冲区耗尽。

My current implementation is here .我当前的实现是here My plan was to add input_buf and input_pos to the Machine struct which represents the state of the VM.我的计划是将input_bufinput_pos添加到表示 VM 状态的Machine结构中。

As thoroughly described in Why can't I store a value and a reference to that value in the same struct?正如为什么我不能在同一个结构中存储值和对该值的引用中的详细描述 , in general you can't do this because it truly is unsafe . ,通常你不能这样做,因为它确实是不安全的 When you move memory, you invalidate references.移动内存时,会使引用无效。 This is why a lot of people use Rust - to not have invalid references which lead to program crashes!这就是为什么很多人使用 Rust - 没有导致程序崩溃的无效引用!

Let's look at your code:让我们看看你的代码:

io::stdin().read_line(&mut self.input_buf)?;
self.input_pos = self.input_buf.chars();

Between these two lines, you've left self.input_pos in a bad state.在这两行之间,您让self.input_pos处于糟糕的状态。 If a panic occurs, then the destructor of the object has the opportunity to access invalid memory!如果发生panic,那么对象的析构函数就有机会访问无效内存! Rust is protecting you from an issue that most people never think about. Rust 正在保护您免受大多数人从未考虑过的问题。


As also described in that answer:作为这个问题的答案描述:

There is a special case where the lifetime tracking is overzealous: when you have something placed on the heap.有一种特殊情况,生命周期跟踪过于热情:当你在堆上放置了一些东西时。 This occurs when you use a Box<T> , for example.例如,当您使用Box<T>时会发生这种情况。 In this case, the structure that is moved contains a pointer into the heap.在这种情况下,被移动的结构包含一个指向堆的指针。 The pointed-at value will remain stable, but the address of the pointer itself will move.指向的值将保持稳定,但指针本身的地址将移动。 In practice, this doesn't matter, as you always follow the pointer.在实践中,这无关紧要,因为您始终遵循指针。

Some crates provide ways of representing this case, but they require that the base address never move .一些 crate 提供了表示这种情况的方法,但它们要求基地址永远不会移动 This rules out mutating vectors, which may cause a reallocation and a move of the heap-allocated values.这排除了可能导致重新分配和移动堆分配值的变异向量。

Remember that a String is just a vector of bytes with extra preconditions added.请记住, String只是添加了额外前提条件的字节向量。

Instead of using one of those crates, we can also roll our own solution, which means we (read you ) get to accept all the responsibility for ensuring that we aren't doing anything wrong.而不是使用那些箱子之一,我们也可以推出自己的解决方案,这意味着我们(读)获得接受,以确保我们没有做错任何事情的一切责任。

The trick here is to ensure that the data inside the String never moves and no accidental references are taken.这里的技巧是确保String中的数据永远不会移动并且不会发生意外引用。

use std::{mem, str::Chars};

/// I believe this struct to be safe because the String is
/// heap-allocated (stable address) and will never be modified
/// (stable address). `chars` will not outlive the struct, so
/// lying about the lifetime should be fine.
///
/// TODO: What about during destruction?
///       `Chars` shouldn't have a destructor...
struct OwningChars {
    _s: String,
    chars: Chars<'static>,
}

impl OwningChars {
    fn new(s: String) -> Self {
        let chars = unsafe { mem::transmute(s.chars()) };
        OwningChars { _s: s, chars }
    }
}

impl Iterator for OwningChars {
    type Item = char;
    fn next(&mut self) -> Option<Self::Item> {
        self.chars.next()
    }
}

You might even think about putting just this code into a module so that you can't accidentally muck about with the innards.你甚至可以考虑把眼前这个代码到一个模块,这样就可以不小心渣土约与内脏。


Here's the same code using the ouroboros crate to create a self-referential struct containing the String and a Chars iterator:以下是使用ouroboros crate 创建包含StringChars迭代器的自引用结构的相同代码:

use ouroboros::self_referencing; // 0.4.1
use std::str::Chars;

#[self_referencing]
pub struct IntoChars {
    string: String,
    #[borrows(string)]
    chars: Chars<'this>,
}

// All these implementations are based on what `Chars` implements itself

impl Iterator for IntoChars {
    type Item = char;

    #[inline]
    fn next(&mut self) -> Option<Self::Item> {
        self.with_mut(|me| me.chars.next())
    }

    #[inline]
    fn count(mut self) -> usize {
        self.with_mut(|me| me.chars.count())
    }

    #[inline]
    fn size_hint(&self) -> (usize, Option<usize>) {
        self.with(|me| me.chars.size_hint())
    }

    #[inline]
    fn last(mut self) -> Option<Self::Item> {
        self.with_mut(|me| me.chars.last())
    }
}

impl DoubleEndedIterator for IntoChars {
    #[inline]
    fn next_back(&mut self) -> Option<Self::Item> {
        self.with_mut(|me| me.chars.next_back())
    }
}

impl std::iter::FusedIterator for IntoChars {}

// And an extension trait for convenience

trait IntoCharsExt {
    fn into_chars(self) -> IntoChars;
}

impl IntoCharsExt for String {
    fn into_chars(self) -> IntoChars {
        IntoCharsBuilder {
            string: self,
            chars_builder: |s| s.chars(),
        }
        .build()
    }
}

Here's the same code using the rental crate to create a self-referential struct containing the String and a Chars iterator:这是使用租用箱创建包含StringChars迭代器的自引用结构的相同代码:

#[macro_use]
extern crate rental; // 0.5.5

rental! {
    mod into_chars {
        pub use std::str::Chars;

        #[rental]
        pub struct IntoChars {
            string: String,
            chars: Chars<'string>,
        }
    }
}

use into_chars::IntoChars;

// All these implementations are based on what `Chars` implements itself

impl Iterator for IntoChars {
    type Item = char;

    #[inline]
    fn next(&mut self) -> Option<Self::Item> {
        self.rent_mut(|chars| chars.next())
    }

    #[inline]
    fn count(mut self) -> usize {
        self.rent_mut(|chars| chars.count())
    }

    #[inline]
    fn size_hint(&self) -> (usize, Option<usize>) {
        self.rent(|chars| chars.size_hint())
    }

    #[inline]
    fn last(mut self) -> Option<Self::Item> {
        self.rent_mut(|chars| chars.last())
    }
}

impl DoubleEndedIterator for IntoChars {
    #[inline]
    fn next_back(&mut self) -> Option<Self::Item> {
        self.rent_mut(|chars| chars.next_back())
    }
}

impl std::iter::FusedIterator for IntoChars {}

// And an extension trait for convenience

trait IntoCharsExt {
    fn into_chars(self) -> IntoChars;
}

impl IntoCharsExt for String {
    fn into_chars(self) -> IntoChars {
        IntoChars::new(self, |s| s.chars())
    }
}

This answer doesn't address the general problem of trying to store an iterator in the same struct as the object that it is iterating over.这个答案没有解决尝试将迭代器存储在与它正在迭代的对象相同的结构中的一般问题。 However, in this particular case we can get around the problem by storing an integer byte index into the string instead of the iterator.但是,在这种特殊情况下,我们可以通过将整数字节索引而不是迭代器存储到字符串中来解决这个问题。 Rust will let you create a string slice using this byte index and then we can use that to extract the next character starting from that point. Rust 会让你使用这个字节索引创建一个字符串切片,然后我们可以使用它来提取从那个点开始的下一个字符。 Next we just need to update the byte index by the number of bytes the code point takes up in UTF-8.接下来我们只需要根据代码点在 UTF-8 中占用的字节数来更新字节索引。 We can do this with char::len_utf8() .我们可以用char::len_utf8()做到这一点。

This would work like the below:这将像下面这样工作:

struct CharGetter {
    // Buffer containing one line of input at a time
    input_buf: String,
    // The byte position within input_buf of the next character to
    // return.
    input_pos: usize,
}

impl CharGetter {
    fn next(&mut self) -> Result<char, std::io::Error> {
        loop {
            // Get an iterator over the string slice starting at the
            // next byte position in the string
            let mut input_pos = self.input_buf[self.input_pos..].chars();

            // Try to get a character from the temporary iterator
            match input_pos.next() {
                // If there is still a character left in the input
                // buffer then we can just return it immediately.
                Some(n) => {
                    // Move the position along by the number of bytes
                    // that this character occupies in UTF-8
                    self.input_pos += n.len_utf8();
                    return Ok(n);
                },
                // Otherwise get the next line
                None => {
                    self.input_buf.clear();
                    std::io::stdin().read_line(&mut self.input_buf)?;
                    // Reset the iterator to the beginning of the
                    // line.
                    self.input_pos = 0;
                }
            }
        }
    }
}

In practice this isn't really doing anything that is more safe than storing the iterator because the input_pos variable is still effectively doing the same thing as an iterator and its validity is still dependent on input_buf not being modified.实际上,这并没有做任何比存储迭代器更安全的事情,因为input_pos变量仍然有效地做与迭代器相同的事情,并且其有效性仍然取决于input_buf未被修改。 Presumably if something else modified the buffer in the meantime then the program could panic when the string slice is created because it might no longer be at a character boundary.据推测,如果在此期间有其他东西修改了缓冲区,那么在创建字符串切片时程序可能会发生混乱,因为它可能不再位于字符边界处。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM