[英]How to index a String in Rust
I am attempting to index a string in Rust, but the compiler throws an error.我试图在 Rust 中索引一个字符串,但编译器抛出错误。 My code (Project Euler problem 4, playground ):
我的代码(欧拉计划问题 4, 游乐场):
fn is_palindrome(num: u64) -> bool {
let num_string = num.to_string();
let num_length = num_string.len();
for i in 0 .. num_length / 2 {
if num_string[i] != num_string[(num_length - 1) - i] {
return false;
}
}
true
}
The error:错误:
error[E0277]: the trait bound `std::string::String: std::ops::Index<usize>` is not satisfied
--> <anon>:7:12
|
7 | if num_string[i] != num_string[(num_length - 1) - i] {
| ^^^^^^^^^^^^^
|
= note: the type `std::string::String` cannot be indexed by `usize`
Is there a reason why String
can not be indexed? String
不能被索引是有原因的吗? How can I access the data then?那我怎样才能访问数据呢?
Yes, indexing into a string is not available in Rust.是的,在 Rust 中无法对字符串进行索引。 The reason for this is that Rust strings are encoded in UTF-8 internally, so the concept of indexing itself would be ambiguous, and people would misuse it: byte indexing is fast, but almost always incorrect (when your text contains non-ASCII symbols, byte indexing may leave you inside a character, which is really bad if you need text processing), while char indexing is not free because UTF-8 is a variable-length encoding, so you have to traverse the entire string to find the required code point.
这样做的原因是 Rust 字符串在内部以 UTF-8 编码,因此索引本身的概念会很模糊,人们会误用它:字节索引很快,但几乎总是不正确(当您的文本包含非 ASCII 符号时) ,字节索引可能会让你留在一个字符中,如果你需要文本处理,这真的很糟糕),而字符索引不是免费的,因为 UTF-8 是可变长度编码,所以你必须遍历整个字符串才能找到所需的代码点。
If you are certain that your strings contain ASCII characters only, you can use the as_bytes()
method on &str
which returns a byte slice, and then index into this slice:如果你确定你的字符串只包含 ASCII 字符,你可以在
&str
上使用as_bytes()
方法,它返回一个字节片,然后索引到这个片中:
let num_string = num.to_string();
// ...
let b: u8 = num_string.as_bytes()[i];
let c: char = b as char; // if you need to get the character as a unicode code point
If you do need to index code points, you have to use the char()
iterator:如果确实需要索引代码点,则必须使用
char()
迭代器:
num_string.chars().nth(i).unwrap()
As I said above, this would require traversing the entire iterator up to the i
th code element.正如我上面所说,这需要遍历整个迭代器直到第
i
个代码元素。
Finally, in many cases of text processing, it is actually necessary to work with grapheme clusters rather than with code points or bytes.最后,在文本处理的许多情况下,实际上需要使用字素簇而不是代码点或字节。 With the help of the unicode-segmentation crate, you can index into grapheme clusters as well:
在unicode-segmentation crate 的帮助下,您也可以索引到字形簇中:
use unicode_segmentation::UnicodeSegmentation
let string: String = ...;
UnicodeSegmentation::graphemes(&string, true).nth(i).unwrap()
Naturally, grapheme cluster indexing has the same requirement of traversing the entire string as indexing into code points.自然地,字素簇索引具有与索引到代码点相同的遍历整个字符串的要求。
The correct approach to doing this sort of thing in Rust is not indexing but iteration .在 Rust 中做这种事情的正确方法不是索引而是迭代。 The main problem here is that Rust's strings are encoded in UTF-8, a variable-length encoding for Unicode characters.
这里的主要问题是 Rust 的字符串是用 UTF-8 编码的,UTF-8 是 Unicode 字符的可变长度编码。 Being variable in length, the memory position of the nth character can't determined without looking at the string.
由于长度可变,不查看字符串就无法确定第 n 个字符的内存位置。 This also means that accessing the nth character has a runtime of O(n)!
这也意味着访问第 n 个字符的运行时间为 O(n)!
In this special case, you can iterate over the bytes, because your string is known to only contain the characters 0–9 (iterating over the characters is the more general solution but is a little less efficient).在这种特殊情况下,您可以遍历字节,因为已知您的字符串仅包含字符 0-9(遍历字符是更通用的解决方案,但效率稍低)。
Here is some idiomatic code to achieve this ( playground ):这是一些惯用的代码来实现这一点( 操场):
fn is_palindrome(num: u64) -> bool {
let num_string = num.to_string();
let half = num_string.len() / 2;
num_string.bytes().take(half).eq(num_string.bytes().rev().take(half))
}
We go through the bytes in the string both forwards ( num_string.bytes().take(half)
) and backwards ( num_string.bytes().rev().take(half)
) simultaneously;我们同时向前(
num_string.bytes().take(half)
)和向后( num_string.bytes().rev().take(half)
) num_string.bytes().take(half)
字符串中的字节; the .take(half)
part is there to halve the amount of work done. .take(half)
部分用于将完成的工作量减半。 We then simply compare one iterator to the other one to ensure at each step that the nth and nth last bytes are equivalent;然后我们简单地将一个迭代器与另一个迭代器进行比较,以确保在每个步骤中第 n 个和第 n 个最后一个字节是等效的; if they are, it returns true;
如果是,则返回 true; if not, false.
如果不是,则为假。
If what you are looking for is something similar to an index, you can use如果您要查找的内容类似于索引,则可以使用
.chars()
and .nth()
on a string. .chars()
和.nth()
字符串。
.chars()
-> Returns an iterator over the char
s of a string slice. .chars()
-> 在字符串切片的char
返回迭代器。
.nth()
-> Returns the nth element of the iterator, in an Option
.nth()
-> 在Option
返回迭代器的第 n 个元素
Now you can use the above in several ways, for example:现在您可以通过多种方式使用上述内容,例如:
let s: String = String::from("abc");
//If you are sure
println!("{}", s.chars().nth(x).unwrap());
//or if not
println!("{}", s.chars().nth(x).expect("message"));
You can convert a String
or &str
to a vec
of a chars and then index that vec
.您可以将
String
或&str
转换为字符的vec
,然后索引该vec
。
For example:例如:
fn main() {
let s = "Hello world!";
let my_vec: Vec<char> = s.chars().collect();
println!("my_vec[0]: {}", my_vec[0]);
println!("my_vec[1]: {}", my_vec[1]);
}
Indexing on String is not allowed because (please check the book ):不允许在 String 上建立索引,因为(请查看本书):
So if you input doesn't contain diacritics (considered as a separate character ) and it's ok to approximate letter with character, you can use chars() iterator and DoubleEndedIterator trait for two pointers approach:因此,如果您输入的内容不包含变音符号(被视为单独的字符)并且可以用字符来近似字母,则可以使用chars()迭代器和DoubleEndedIterator特性来实现两个指针方法:
fn is_palindrome(num: u64) -> bool {
let s = num.to_string();
let mut iterator = s.chars();
loop {
let ch = iterator.next();
let ch_end = iterator.next_back();
if ch.is_none() || ch_end.is_none() {
break;
}
if ch.unwrap() != ch_end.unwrap() {
return false
}
}
true
}
this is not suitable for all uses by any means, but if you just need to reference the previous character (or, with a little rework, the next character), then it's possible to do so without iterating through the entire str.这无论如何都不适用于所有用途,但如果您只需要引用前一个字符(或者,稍作修改,下一个字符),那么可以在不遍历整个 str 的情况下这样做。
the scenario here is that there is a str slice, string, and pattern was found in the slice.这里的场景是在切片中找到了一个 str 切片、字符串和模式。 i want to know the character immediately before the pattern.
我想知道模式之前的字符。
call prev_char like prev_char(string.as_bytes(), pattern_index)
where pattern index is the index of the first byte of pattern in string.像
prev_char(string.as_bytes(), pattern_index)
一样调用 prev_char prev_char(string.as_bytes(), pattern_index)
其中模式索引是字符串中模式第一个字节的索引。
utf-8 encoding is well defined and this works just by backing up until it finds one of the starting bytes (either high order bit 0 or bits 11) and then converting that 1-4 byte [u8] slice to a str. utf-8 编码定义良好,这只是通过备份直到找到起始字节之一(高位位 0 或位 11)然后将该 1-4 字节 [u8] 切片转换为 str 来工作。
this code just unwraps it because the pattern was found in a valid utf-8 str to begin with, so no error is possible.这段代码只是解开它,因为该模式是在有效的 utf-8 str 中找到的,因此不可能出现错误。 if your data has not been validated it might be best to return a result rather than an Option.
如果您的数据尚未经过验证,最好返回结果而不是选项。
enum PrevCharStates {
Start,
InEncoding,
}
fn prev_char(bytes: &[u8], starting_index: usize) -> Option<&str> {
let mut ix = starting_index;
let mut state = PrevCharStates::Start;
while ix > 0 {
ix -= 1;
let byte = bytes[ix];
match state {
PrevCharStates::Start => {
if byte & 0b10000000 == 0 {
return Some(std::str::from_utf8(&bytes[ix..starting_index]).unwrap());
} else if byte & 0b11000000 == 0b10000000 {
state = PrevCharStates::InEncoding;
}
},
PrevCharStates::InEncoding => {
if byte & 0b11000000 == 0b11000000 {
return Some(std::str::from_utf8(&bytes[ix..starting_index]).unwrap());
} else if byte & 0b11000000 != 0b10000000 {
return None;
}
}
}
}
None
}
The bellow code works fine, not sure about performance and O complexity and hopefully someone can add more information about this solution.波纹管代码工作正常,不确定性能和 O 复杂性,希望有人可以添加有关此解决方案的更多信息。
fn is_palindrome(num: u64) -> bool {
let num_string = String::from(num.to_string());
let num_length = num_string.len();
for i in 0..num_length / 2 {
let left = &num_string[i..i + 1];
let right = &num_string[((num_length - 1) - i)..num_length - i];
if left != right {
return false;
}
}
true
}
There are two reasons indexing is not working in Rust:索引在 Rust 中不起作用的原因有两个:
In rust, strings are stored as a collection of utf-8
encoded bytes.在 rust 中,字符串被存储为
utf-8
编码字节的集合。 In memory, strings are just collections of 1's and 0's.在 memory 中,字符串只是 collections 的 1 和 0。 a program needs to be able to interpret those 1's and 0's and print out the correct characters.
程序需要能够解释那些 1 和 0 并打印出正确的字符。 that's where encoding comes into play.
这就是编码发挥作用的地方。
fn main(){ let sample:String=String::from("2bytesPerChar") // we could this in higher programming languages. in rust we get error. cannot be indexed by an integer let c:char=sample[0] }
String is a collection of bytes.字符串是字节的集合。 so what is the lenght of our "2bytesPerChar".
那么我们的“2bytesPerChar”的长度是多少? Because some chars can be 1 to 4 bytes long.
因为有些字符可以是 1 到 4 个字节长。 Assume that first character has 2 bytes.
假设第一个字符有 2 个字节。 If you want to get the first char in string, using the indexing, hello[0] will specify the first byte which is the only half of the first string.
如果你想获取字符串中的第一个字符,使用索引,hello[0] 将指定第一个字节,它是第一个字符串的唯一一半。
Bytes
, scalar values
, grapheme clusters
.Bytes
、 scalar values
、字素grapheme clusters
。 If we use indexing rust does not know what we will receive.Return bytes返回字节
for b in "dsfsd".bytes(){ // bytes method returns a collection of bytes and here we are iterating over every byte and printing it out println,("{}",b) }
Return scalar values:返回标量值:
// we could iterate over scalar values using char methods
for c in "kjdskj".chars(){
println!("{}",c)
}
In order to keep rust standard library lean, the ability iterate over graphene clusters is not included by default.为了保持 rust 标准库精简,默认情况下不包括迭代石墨烯簇的能力。 we need to import a crate
我们需要导入一个箱子
// in cargo.toml
[dependencies]
unicode-segmentation="1.7.1"
then:然后:
use unicode_segmentation::UnicodeSegmentation;
// we pass true to get extended grapheme clusters
for g in "dada"graphemes(true){
println!("{}",g)
}
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.