简体   繁体   English

如何索引 Rust 中的字符串

[英]How to index a String in Rust

I am attempting to index a string in Rust, but the compiler throws an error.我试图在 Rust 中索引一个字符串,但编译器抛出错误。 My code (Project Euler problem 4, playground ):我的代码(欧拉计划问题 4, 游乐场):

fn is_palindrome(num: u64) -> bool {
    let num_string = num.to_string();
    let num_length = num_string.len();

    for i in 0 .. num_length / 2 {
        if num_string[i] != num_string[(num_length - 1) - i] {
            return false;
        }
    }
    
    true
}

The error:错误:

error[E0277]: the trait bound `std::string::String: std::ops::Index<usize>` is not satisfied
 --> <anon>:7:12
  |
7 |         if num_string[i] != num_string[(num_length - 1) - i] {
  |            ^^^^^^^^^^^^^
  |
  = note: the type `std::string::String` cannot be indexed by `usize`

Is there a reason why String can not be indexed? String不能被索引是有原因的吗? How can I access the data then?那我怎样才能访问数据呢?

Yes, indexing into a string is not available in Rust.是的,在 Rust 中无法对字符串进行索引。 The reason for this is that Rust strings are encoded in UTF-8 internally, so the concept of indexing itself would be ambiguous, and people would misuse it: byte indexing is fast, but almost always incorrect (when your text contains non-ASCII symbols, byte indexing may leave you inside a character, which is really bad if you need text processing), while char indexing is not free because UTF-8 is a variable-length encoding, so you have to traverse the entire string to find the required code point.这样做的原因是 Rust 字符串在内部以 UTF-8 编码,因此索引本身的概念会很模糊,人们会误用它:字节索引很快,但几乎总是不正确(当您的文本包含非 ASCII 符号时) ,字节索引可能会让你留在一个字符中,如果你需要文本处理,这真的很糟糕),而字符索引不是免费的,因为 UTF-8 是可变长度编码,所以你必须遍历整个字符串才能找到所需的代码点。

If you are certain that your strings contain ASCII characters only, you can use the as_bytes() method on &str which returns a byte slice, and then index into this slice:如果你确定你的字符串只包含 ASCII 字符,你可以在&str上使用as_bytes()方法,它返回一个字节片,然后索引到这个片中:

let num_string = num.to_string();

// ...

let b: u8 = num_string.as_bytes()[i];
let c: char = b as char;  // if you need to get the character as a unicode code point

If you do need to index code points, you have to use the char() iterator:如果确实需要索引代码点,则必须使用char()迭代器:

num_string.chars().nth(i).unwrap()

As I said above, this would require traversing the entire iterator up to the i th code element.正如我上面所说,这需要遍历整个迭代器直到第i个代码元素。

Finally, in many cases of text processing, it is actually necessary to work with grapheme clusters rather than with code points or bytes.最后,在文本处理的许多情况下,实际上需要使用字素而不是代码点或字节。 With the help of the unicode-segmentation crate, you can index into grapheme clusters as well:unicode-segmentation crate 的帮助下,您也可以索引到字形簇中:

use unicode_segmentation::UnicodeSegmentation

let string: String = ...;
UnicodeSegmentation::graphemes(&string, true).nth(i).unwrap()

Naturally, grapheme cluster indexing has the same requirement of traversing the entire string as indexing into code points.自然地,字素簇索引具有与索引到代码点相同的遍历整个字符串的要求。

The correct approach to doing this sort of thing in Rust is not indexing but iteration .在 Rust 中做这种事情的正确方法不是索引而是迭代 The main problem here is that Rust's strings are encoded in UTF-8, a variable-length encoding for Unicode characters.这里的主要问题是 Rust 的字符串是用 UTF-8 编码的,UTF-8 是 Unicode 字符的可变长度编码。 Being variable in length, the memory position of the nth character can't determined without looking at the string.由于长度可变,不查看字符串就无法确定第 n 个字符的内存位置。 This also means that accessing the nth character has a runtime of O(n)!这也意味着访问第 n 个字符的运行时间为 O(n)!

In this special case, you can iterate over the bytes, because your string is known to only contain the characters 0–9 (iterating over the characters is the more general solution but is a little less efficient).在这种特殊情况下,您可以遍历字节,因为已知您的字符串仅包含字符 0-9(遍历字符是更通用的解决方案,但效率稍低)。

Here is some idiomatic code to achieve this ( playground ):这是一些惯用的代码来实现这一点( 操场):

fn is_palindrome(num: u64) -> bool {
    let num_string = num.to_string();
    let half = num_string.len() / 2;

    num_string.bytes().take(half).eq(num_string.bytes().rev().take(half))
}

We go through the bytes in the string both forwards ( num_string.bytes().take(half) ) and backwards ( num_string.bytes().rev().take(half) ) simultaneously;我们同时向前( num_string.bytes().take(half) )和向后( num_string.bytes().rev().take(half)num_string.bytes().take(half)字符串中的字节; the .take(half) part is there to halve the amount of work done. .take(half)部分用于将完成的工作量减半。 We then simply compare one iterator to the other one to ensure at each step that the nth and nth last bytes are equivalent;然后我们简单地将一个迭代器与另一个迭代器进行比较,以确保在每个步骤中第 n 个和第 n 个最后一个字节是等效的; if they are, it returns true;如果是,则返回 true; if not, false.如果不是,则为假。

If what you are looking for is something similar to an index, you can use如果您要查找的内容类似于索引,则可以使用

.chars() and .nth() on a string. .chars().nth()字符串。


.chars() -> Returns an iterator over the char s of a string slice. .chars() -> 在字符串切片的char返回迭代器。

.nth() -> Returns the nth element of the iterator, in an Option .nth() -> 在Option返回迭代器的第 n 个元素


Now you can use the above in several ways, for example:现在您可以通过多种方式使用上述内容,例如:

let s: String = String::from("abc");
//If you are sure
println!("{}", s.chars().nth(x).unwrap());
//or if not
println!("{}", s.chars().nth(x).expect("message"));

You can convert a String or &str to a vec of a chars and then index that vec .您可以将String&str转换为字符的vec ,然后索引该vec

For example:例如:

fn main() {
    let s = "Hello world!";
    let my_vec: Vec<char> = s.chars().collect();
    println!("my_vec[0]: {}", my_vec[0]);
    println!("my_vec[1]: {}", my_vec[1]);
}

Here you have a live example这里有一个活生生的例子

Indexing on String is not allowed because (please check the book ):不允许在 String 上建立索引,因为(请查看本书):

  • it is not clear what the indexed value should be: a byte, a character, or a grapheme cluster (which we call a letter in common sense)不清楚索引值应该是什么:一个字节,一个字符,还是一个字素簇(我们常说的字母
  • strings are vectors of bytes (u8) encoded with UTF-8 and UTF-8 is a variable length encoding, ie every character can take different number of bytes - from 1 to 4. So to get a character or grapheme cluster by index would require a whole string traversal (O(n) in average and the worst cases) from the beginning to determine valid bytes bounds of the character or the grapheme.字符串是用 UTF-8 编码的字节 (u8) 向量,而 UTF-8 是可变长度编码,即每个字符可以采用不同数量的字节 - 从 1 到 4。因此,要通过索引获取字符或字素簇,需要遍历整个字符串(平均和最坏情况下为 O(n))从头开始确定字符或字素的有效字节边界。

So if you input doesn't contain diacritics (considered as a separate character ) and it's ok to approximate letter with character, you can use chars() iterator and DoubleEndedIterator trait for two pointers approach:因此,如果您输入的内容不包含变音符号(被视为单独的字符)并且可以用字符来近似字母,则可以使用chars()迭代器和DoubleEndedIterator特性来实现两个指针方法:

    fn is_palindrome(num: u64) -> bool {
        let s = num.to_string();
        let mut iterator = s.chars();
        loop  {
            let ch = iterator.next();
            let ch_end = iterator.next_back();
            
            if ch.is_none() || ch_end.is_none() {
                break;
            }
            if ch.unwrap() != ch_end.unwrap() {
                return false
            }
        }
        true
    }

this is not suitable for all uses by any means, but if you just need to reference the previous character (or, with a little rework, the next character), then it's possible to do so without iterating through the entire str.这无论如何都不适用于所有用途,但如果您只需要引用前一个字符(或者,稍作修改,下一个字符),那么可以在不遍历整个 str 的情况下这样做。

the scenario here is that there is a str slice, string, and pattern was found in the slice.这里的场景是在切片中找到了一个 str 切片、字符串和模式。 i want to know the character immediately before the pattern.我想知道模式之前的字符。

call prev_char like prev_char(string.as_bytes(), pattern_index) where pattern index is the index of the first byte of pattern in string.prev_char(string.as_bytes(), pattern_index)一样调用 prev_char prev_char(string.as_bytes(), pattern_index)其中模式索引是字符串中模式第一个字节的索引。

utf-8 encoding is well defined and this works just by backing up until it finds one of the starting bytes (either high order bit 0 or bits 11) and then converting that 1-4 byte [u8] slice to a str. utf-8 编码定义良好,这只是通过备份直到找到起始字节之一(高位位 0 或位 11)然后将该 1-4 字节 [u8] 切片转换为 str 来工作。

this code just unwraps it because the pattern was found in a valid utf-8 str to begin with, so no error is possible.这段代码只是解开它,因为该模式是在有效的 utf-8 str 中找到的,因此不可能出现错误。 if your data has not been validated it might be best to return a result rather than an Option.如果您的数据尚未经过验证,最好返回结果而不是选项。

enum PrevCharStates {
    Start,
    InEncoding,
}

fn prev_char(bytes: &[u8], starting_index: usize) -> Option<&str> {
    let mut ix = starting_index;
    let mut state = PrevCharStates::Start;

    while ix > 0 {
        ix -= 1;
        let byte = bytes[ix];
        match state {
            PrevCharStates::Start => {
                if byte & 0b10000000 == 0 {
                    return Some(std::str::from_utf8(&bytes[ix..starting_index]).unwrap());
                } else if byte & 0b11000000 == 0b10000000 {
                    state = PrevCharStates::InEncoding;
                }
            },
            PrevCharStates::InEncoding => {
                if byte & 0b11000000 == 0b11000000 {
                    return Some(std::str::from_utf8(&bytes[ix..starting_index]).unwrap());
                } else if byte & 0b11000000 != 0b10000000 {
                    return None;
                }
            }
        }
    }
    None
}

The bellow code works fine, not sure about performance and O complexity and hopefully someone can add more information about this solution.波纹管代码工作正常,不确定性能和 O 复杂性,希望有人可以添加有关此解决方案的更多信息。

fn is_palindrome(num: u64) -> bool {
    let num_string = String::from(num.to_string());
    let num_length = num_string.len();
    for i in 0..num_length / 2 {
        let left = &num_string[i..i + 1];
        let right = &num_string[((num_length - 1) - i)..num_length - i];
        if left != right {
            return false;
        }
    }
    true
}

There are two reasons indexing is not working in Rust:索引在 Rust 中不起作用的原因有两个:

  • In rust, strings are stored as a collection of utf-8 encoded bytes.在 rust 中,字符串被存储为utf-8编码字节的集合。 In memory, strings are just collections of 1's and 0's.在 memory 中,字符串只是 collections 的 1 和 0。 a program needs to be able to interpret those 1's and 0's and print out the correct characters.程序需要能够解释那些 1 和 0 并打印出正确的字符。 that's where encoding comes into play.这就是编码发挥作用的地方。

     fn main(){ let sample:String=String::from("2bytesPerChar") // we could this in higher programming languages. in rust we get error. cannot be indexed by an integer let c:char=sample[0] }

String is a collection of bytes.字符串是字节的集合。 so what is the lenght of our "2bytesPerChar".那么我们的“2bytesPerChar”的长度是多少? Because some chars can be 1 to 4 bytes long.因为有些字符可以是 1 到 4 个字节长。 Assume that first character has 2 bytes.假设第一个字符有 2 个字节。 If you want to get the first char in string, using the indexing, hello[0] will specify the first byte which is the only half of the first string.如果你想获取字符串中的第一个字符,使用索引,hello[0] 将指定第一个字节,它是第一个字符串的唯一一半。

  • Another reason is there are 3 relevant ways a word in represented in unicode: Bytes , scalar values , grapheme clusters .另一个原因是在 unicode 中表示一个词有 3 种相关方式: Bytesscalar values 、字素grapheme clusters If we use indexing rust does not know what we will receive.如果我们使用索引 rust 不知道我们会收到什么。 Bytes, scalar value or grapheme clusters.字节、标量值或字素簇。 so we have to use more specific methods.所以我们必须使用更具体的方法。

How to access the characters in String如何访问String中的字符

  • Return bytes返回字节

     for b in "dsfsd".bytes(){ // bytes method returns a collection of bytes and here we are iterating over every byte and printing it out println,("{}",b) }
  • Return scalar values:返回标量值:

   // we could iterate over scalar values using char methods
   for c in "kjdskj".chars(){
       println!("{}",c)
   }
  • return grapheme values:返回字形值:

In order to keep rust standard library lean, the ability iterate over graphene clusters is not included by default.为了保持 rust 标准库精简,默认情况下不包括迭代石墨烯簇的能力。 we need to import a crate我们需要导入一个箱子

// in cargo.toml
   [dependencies]
   unicode-segmentation="1.7.1"

then:然后:

   use unicode_segmentation::UnicodeSegmentation;
   // we pass true to get extended grapheme clusters
   for g in "dada"graphemes(true){
       println!("{}",g)
   }

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM