简体   繁体   中英

How can I store a Chars iterator in the same struct as the String it is iterating on?

I am just beginning to learn Rust and I'm struggling to handle the lifetimes.

I'd like to have a struct with a String in it which will be used to buffer lines from stdin. Then I'd like to have a method on the struct which returns the next character from the buffer, or if all of the characters from the line have been consumed it will read the next line from stdin.

The documentation says that Rust strings aren't indexable by character because that is inefficient with UTF-8. As I'm accessing the characters sequentially it should be fine to use an iterator. However, as far as I understand, iterators in Rust are tied to the lifetime of the thing they're iterating and I can't work out how I could store this iterator in the struct alongside the String .

Here is the pseudo-Rust that I'd like to achieve. Obviously it doesn't compile.

struct CharGetter {
    /* Buffer containing one line of input at a time */
    input_buf: String,
    /* The position within input_buf of the next character to
     * return. This needs a lifetime parameter. */
    input_pos: std::str::Chars
}

impl CharGetter {
    fn next(&mut self) -> Result<char, io::Error> {
        loop {
            match self.input_pos.next() {
                /* If there is still a character left in the input
                 * buffer then we can just return it immediately. */
                Some(n) => return Ok(n),
                /* Otherwise get the next line */
                None => {
                    io::stdin().read_line(&mut self.input_buf)?;
                    /* Reset the iterator to the beginning of the
                     * line. Obviously this doesn’t work because it’s
                     * not obeying the lifetime of input_buf */
                    self.input_pos = self.input_buf.chars();
                }
            }
        }
    }
}

I am trying to do the Synacor challenge . This involves implementing a virtual machine where one of the opcodes reads a character from stdin and stores it in a register. I have this part working fine. The documentation states that whenever the program inside the VM reads a character it will keep reading until it reads a whole line. I wanted to take advantage of this to add a “save” command to my implementation. That means that whenever the program asks for a character, I will read a line from the input. If the line is “save”, I will save the state of the VM and then continue to get another line to feed to the VM. Each time the VM executes the input opcode, I need to be able to give it one character at a time from the buffered line until the buffer is depleted.

My current implementation is here . My plan was to add input_buf and input_pos to the Machine struct which represents the state of the VM.

As thoroughly described in Why can't I store a value and a reference to that value in the same struct? , in general you can't do this because it truly is unsafe . When you move memory, you invalidate references. This is why a lot of people use Rust - to not have invalid references which lead to program crashes!

Let's look at your code:

io::stdin().read_line(&mut self.input_buf)?;
self.input_pos = self.input_buf.chars();

Between these two lines, you've left self.input_pos in a bad state. If a panic occurs, then the destructor of the object has the opportunity to access invalid memory! Rust is protecting you from an issue that most people never think about.


As also described in that answer:

There is a special case where the lifetime tracking is overzealous: when you have something placed on the heap. This occurs when you use a Box<T> , for example. In this case, the structure that is moved contains a pointer into the heap. The pointed-at value will remain stable, but the address of the pointer itself will move. In practice, this doesn't matter, as you always follow the pointer.

Some crates provide ways of representing this case, but they require that the base address never move . This rules out mutating vectors, which may cause a reallocation and a move of the heap-allocated values.

Remember that a String is just a vector of bytes with extra preconditions added.

Instead of using one of those crates, we can also roll our own solution, which means we (read you ) get to accept all the responsibility for ensuring that we aren't doing anything wrong.

The trick here is to ensure that the data inside the String never moves and no accidental references are taken.

use std::{mem, str::Chars};

/// I believe this struct to be safe because the String is
/// heap-allocated (stable address) and will never be modified
/// (stable address). `chars` will not outlive the struct, so
/// lying about the lifetime should be fine.
///
/// TODO: What about during destruction?
///       `Chars` shouldn't have a destructor...
struct OwningChars {
    _s: String,
    chars: Chars<'static>,
}

impl OwningChars {
    fn new(s: String) -> Self {
        let chars = unsafe { mem::transmute(s.chars()) };
        OwningChars { _s: s, chars }
    }
}

impl Iterator for OwningChars {
    type Item = char;
    fn next(&mut self) -> Option<Self::Item> {
        self.chars.next()
    }
}

You might even think about putting just this code into a module so that you can't accidentally muck about with the innards.


Here's the same code using the ouroboros crate to create a self-referential struct containing the String and a Chars iterator:

use ouroboros::self_referencing; // 0.4.1
use std::str::Chars;

#[self_referencing]
pub struct IntoChars {
    string: String,
    #[borrows(string)]
    chars: Chars<'this>,
}

// All these implementations are based on what `Chars` implements itself

impl Iterator for IntoChars {
    type Item = char;

    #[inline]
    fn next(&mut self) -> Option<Self::Item> {
        self.with_mut(|me| me.chars.next())
    }

    #[inline]
    fn count(mut self) -> usize {
        self.with_mut(|me| me.chars.count())
    }

    #[inline]
    fn size_hint(&self) -> (usize, Option<usize>) {
        self.with(|me| me.chars.size_hint())
    }

    #[inline]
    fn last(mut self) -> Option<Self::Item> {
        self.with_mut(|me| me.chars.last())
    }
}

impl DoubleEndedIterator for IntoChars {
    #[inline]
    fn next_back(&mut self) -> Option<Self::Item> {
        self.with_mut(|me| me.chars.next_back())
    }
}

impl std::iter::FusedIterator for IntoChars {}

// And an extension trait for convenience

trait IntoCharsExt {
    fn into_chars(self) -> IntoChars;
}

impl IntoCharsExt for String {
    fn into_chars(self) -> IntoChars {
        IntoCharsBuilder {
            string: self,
            chars_builder: |s| s.chars(),
        }
        .build()
    }
}

Here's the same code using the rental crate to create a self-referential struct containing the String and a Chars iterator:

#[macro_use]
extern crate rental; // 0.5.5

rental! {
    mod into_chars {
        pub use std::str::Chars;

        #[rental]
        pub struct IntoChars {
            string: String,
            chars: Chars<'string>,
        }
    }
}

use into_chars::IntoChars;

// All these implementations are based on what `Chars` implements itself

impl Iterator for IntoChars {
    type Item = char;

    #[inline]
    fn next(&mut self) -> Option<Self::Item> {
        self.rent_mut(|chars| chars.next())
    }

    #[inline]
    fn count(mut self) -> usize {
        self.rent_mut(|chars| chars.count())
    }

    #[inline]
    fn size_hint(&self) -> (usize, Option<usize>) {
        self.rent(|chars| chars.size_hint())
    }

    #[inline]
    fn last(mut self) -> Option<Self::Item> {
        self.rent_mut(|chars| chars.last())
    }
}

impl DoubleEndedIterator for IntoChars {
    #[inline]
    fn next_back(&mut self) -> Option<Self::Item> {
        self.rent_mut(|chars| chars.next_back())
    }
}

impl std::iter::FusedIterator for IntoChars {}

// And an extension trait for convenience

trait IntoCharsExt {
    fn into_chars(self) -> IntoChars;
}

impl IntoCharsExt for String {
    fn into_chars(self) -> IntoChars {
        IntoChars::new(self, |s| s.chars())
    }
}

This answer doesn't address the general problem of trying to store an iterator in the same struct as the object that it is iterating over. However, in this particular case we can get around the problem by storing an integer byte index into the string instead of the iterator. Rust will let you create a string slice using this byte index and then we can use that to extract the next character starting from that point. Next we just need to update the byte index by the number of bytes the code point takes up in UTF-8. We can do this with char::len_utf8() .

This would work like the below:

struct CharGetter {
    // Buffer containing one line of input at a time
    input_buf: String,
    // The byte position within input_buf of the next character to
    // return.
    input_pos: usize,
}

impl CharGetter {
    fn next(&mut self) -> Result<char, std::io::Error> {
        loop {
            // Get an iterator over the string slice starting at the
            // next byte position in the string
            let mut input_pos = self.input_buf[self.input_pos..].chars();

            // Try to get a character from the temporary iterator
            match input_pos.next() {
                // If there is still a character left in the input
                // buffer then we can just return it immediately.
                Some(n) => {
                    // Move the position along by the number of bytes
                    // that this character occupies in UTF-8
                    self.input_pos += n.len_utf8();
                    return Ok(n);
                },
                // Otherwise get the next line
                None => {
                    self.input_buf.clear();
                    std::io::stdin().read_line(&mut self.input_buf)?;
                    // Reset the iterator to the beginning of the
                    // line.
                    self.input_pos = 0;
                }
            }
        }
    }
}

In practice this isn't really doing anything that is more safe than storing the iterator because the input_pos variable is still effectively doing the same thing as an iterator and its validity is still dependent on input_buf not being modified. Presumably if something else modified the buffer in the meantime then the program could panic when the string slice is created because it might no longer be at a character boundary.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM