简体   繁体   English

为什么在 Rust 中将字符串的第一个字母大写如此复杂?

[英]Why is capitalizing the first letter of a string so convoluted in Rust?

I'd like to capitalize the first letter of a &str .我想将&str的第一个字母大写。 It's a simple problem and I hope for a simple solution.这是一个简单的问题,我希望有一个简单的解决方案。 Intuition tells me to do something like this:直觉告诉我做这样的事情:

let mut s = "foobar";
s[0] = s[0].to_uppercase();

But &str s can't be indexed like this.但是&str不能像这样被索引。 The only way I've been able to do it seems overly convoluted.我能够做到的唯一方法似乎过于复杂。 I convert the &str to an iterator, convert the iterator to a vector, upper case the first item in the vector, which creates an iterator, which I index into, creating an Option , which I unwrap to give me the upper-cased first letter.我将&str转换为迭代器,将迭代器转换为向量,大写向量中的第一项,这将创建一个迭代器,我将其编入索引,创建一个Option ,我将其展开以给我大写的第一个字母. Then I convert the vector into an iterator, which I convert into a String , which I convert to a &str .然后我将向量转换为迭代器,将其转换为String ,然后将其转换为&str

let s1 = "foobar";
let mut v: Vec<char> = s1.chars().collect();
v[0] = v[0].to_uppercase().nth(0).unwrap();
let s2: String = v.into_iter().collect();
let s3 = &s2;

Is there an easier way than this, and if so, what?有没有比这更简单的方法,如果有,是什么? If not, why is Rust designed this way?如果不是,为什么 Rust 是这样设计的?

Similar question 类似的问题

Why is it so convoluted?为什么这么纠结?

Let's break it down, line-by-line让我们逐行分解

let s1 = "foobar";

We've created a literal string that is encoded in UTF-8 .我们创建了一个以UTF-8编码的文字字符串。 UTF-8 allows us to encode the 1,114,112 code points of Unicode in a manner that's pretty compact if you come from a region of the world that types in mostly characters found in ASCII , a standard created in 1963. UTF-8 is a variable length encoding, which means that a single code point might take from 1 to 4 bytes . UTF-8 允许我们以一种非常紧凑的方式对Unicode的 1,114,112 个代码点进行编码,如果您来自世界上主要输入ASCII中的字符的地区,该标准创建于 1963 年。UTF-8 是一个可变长度编码,这意味着单个代码点可能需要 1 到 4 个字节 The shorter encodings are reserved for ASCII, but many Kanji take 3 bytes in UTF-8 .较短的编码是为 ASCII 保留的,但许多汉字在 UTF-8 中占用 3 个字节

let mut v: Vec<char> = s1.chars().collect();

This creates a vector of char acters.这将创建一个矢量char acters。 A character is a 32-bit number that directly maps to a code point.字符是直接映射到代码点的 32 位数字。 If we started with ASCII-only text, we've quadrupled our memory requirements.如果我们从纯 ASCII 文本开始,我们的内存需求就翻了两番。 If we had a bunch of characters from the astral plane , then maybe we haven't used that much more.如果我们有一堆来自星界的角色,那么也许我们还没有使用更多。

v[0] = v[0].to_uppercase().nth(0).unwrap();

This grabs the first code point and requests that it be converted to an uppercase variant.这会获取第一个代码点并请求将其转换为大写变体。 Unfortunately for those of us who grew up speaking English, there's not always a simple one-to-one mapping of a "small letter" to a "big letter" .不幸的是,对于我们这些说英语长大的人来说, “小写字母”到“大写字母”的映射并不总是简单的一对一 Side note: we call them upper- and lower-case because one box of letters was above the other box of letters back in the day .旁注:我们称它们为大写和小写,因为在当时,一盒字母位于另一盒字母上方

This code will panic when a code point has no corresponding uppercase variant.当代码点没有相应的大写变体时,此代码将发生恐慌。 I'm not sure if those exist, actually.我不确定这些是否存在,实际上。 It could also semantically fail when a code point has an uppercase variant that has multiple characters, such as the German ß .当代码点具有包含多个字符的大写变体(例如德语ß时,它也可能在语义上失败。 Note that ß may never actually be capitalized in The Real World, this is the just example I can always remember and search for.请注意,ß 可能永远不会在现实世界中真正大写,这是我永远记得和搜索的唯一示例。 As of 2017-06-29, in fact, the official rules of German spelling have been updated so that both "ẞ" and "SS" are valid capitalizations !截至2017年6月29日,事实上,德语拼写的官方规则已经这样更新了两个“ẞ”和“SS”是有效的市值

let s2: String = v.into_iter().collect();

Here we convert the characters back into UTF-8 and require a new allocation to store them in, as the original variable was stored in constant memory so as to not take up memory at run time.在这里,我们将字符转换回 UTF-8 并需要一个新的分配来存储它们,因为原始变量存储在常量内存中,以便在运行时不占用内存。

let s3 = &s2;

And now we take a reference to that String .现在我们引用那个String

It's a simple problem这是一个简单的问题

Unfortunately, this is not true.不幸的是,事实并非如此。 Perhaps we should endeavor to convert the world to Esperanto ?也许我们应该努力将世界转换为世界语

I presume char::to_uppercase already properly handles Unicode.我认为char::to_uppercase已经正确处理了 Unicode。

Yes, I certainly hope so.是的,我当然希望如此。 Unfortunately, Unicode isn't enough in all cases.不幸的是,Unicode 在所有情况下都不够。 Thanks to huon for pointing out the Turkish I , where both the upper ( İ ) and lower case ( i ) versions have a dot.由于胡恩您指出土耳其我,其中两个上(I)和小写字母(I)的版本有一个点。 That is, there is no one proper capitalization of the letter i ;也就是说,有信无一倍正确的资本i ; it depends on the locale of the the source text as well.它也取决于源文本的语言环境

why the need for all data type conversions?为什么需要所有数据类型转换?

Because the data types you are working with are important when you are worried about correctness and performance.因为当您担心正确性和性能时,您正在使用的数据类型很重要。 A char is 32-bits and a string is UTF-8 encoded.一个char是 32 位的,一个字符串是 UTF-8 编码的。 They are different things.它们是不同的东西。

indexing could return a multi-byte, Unicode character索引可以返回一个多字节的 Unicode 字符

There may be some mismatched terminology here.这里可能有一些不匹配的术语。 A char is a multi-byte Unicode character. char多字节 Unicode 字符。

Slicing a string is possible if you go byte-by-byte, but the standard library will panic if you are not on a character boundary.如果您逐字节进行切片,则可以对字符串进行切片,但如果您不在字符边界上,则标准库会发生混乱。

One of the reasons that indexing a string to get a character was never implemented is because so many people misuse strings as arrays of ASCII characters.从未实现对字符串进行索引以获取字符的原因之一是因为很多人将字符串误用为 ASCII 字符数组。 Indexing a string to set a character could never be efficient - you'd have to be able to replace 1-4 bytes with a value that is also 1-4 bytes, causing the rest of the string to bounce around quite a lot.索引字符串以设置字符永远不会有效 - 您必须能够将 1-4 个字节替换为也是 1-4 个字节的值,从而导致字符串的其余部分反弹很多。

to_uppercase could return an upper case character to_uppercase可以返回一个大写字符

As mentioned above, ß is a single character that, when capitalized, becomes two characters .如上所述, ß是单个字符,大写时会变成两个字符

Solutions解决方案

See also trentcl's answer which only uppercases ASCII characters.另请参阅trentcl 的答案,该答案仅使用大写 ASCII 字符。

Original原创

If I had to write the code, it'd look like:如果我必须编写代码,它看起来像:

fn some_kind_of_uppercase_first_letter(s: &str) -> String {
    let mut c = s.chars();
    match c.next() {
        None => String::new(),
        Some(f) => f.to_uppercase().chain(c).collect(),
    }
}

fn main() {
    println!("{}", some_kind_of_uppercase_first_letter("joe"));
    println!("{}", some_kind_of_uppercase_first_letter("jill"));
    println!("{}", some_kind_of_uppercase_first_letter("von Hagen"));
    println!("{}", some_kind_of_uppercase_first_letter("ß"));
}

But I'd probably search for uppercase or unicode on crates.io and let someone smarter than me handle it.但我可能会在 crates.io 上搜索大写unicode ,然后让比我更聪明的人来处理它。

Improved改进

Speaking of "someone smarter than me", Veedrac points out that it's probably more efficient to convert the iterator back into a slice after the first capital codepoints are accessed.谈到“比我更聪明的人”, Veedrac 指出在访问第一个大写代码点后将迭代器转换回切片可能更有效。 This allows for a memcpy of the rest of the bytes.这允许对其余字节进行memcpy

fn some_kind_of_uppercase_first_letter(s: &str) -> String {
    let mut c = s.chars();
    match c.next() {
        None => String::new(),
        Some(f) => f.to_uppercase().collect::<String>() + c.as_str(),
    }
}

Is there an easier way than this, and if so, what?有没有比这更简单的方法,如果有,那又怎样? If not, why is Rust designed this way?如果不是,为什么 Rust 是这样设计的?

Well, yes and no.嗯,是和不是。 Your code is, as the other answer pointed out, not correct, and will panic if you give it something like བོད་སྐད་ལ་.正如另一个答案所指出的那样,您的代码是不正确的,如果您给它类似 བོད་སྐད་ལ་ 之类的东西,它会感到恐慌。 So doing this with Rust's standard library is even harder than you initially thought.所以用 Rust 的标准库做这件事比你最初想象的要难。

However, Rust is designed to encourage code reuse and make bringing in libraries easy.然而,Rust 旨在鼓励代码重用并使引入库变得容易。 So the idiomatic way to capitalize a string is actually quite palatable:因此,将字符串大写的惯用方式实际上非常可口:

extern crate inflector;
use inflector::Inflector;

let capitalized = "some string".to_title_case();

It's not especially convoluted if you are able to limit your input to ASCII-only strings.如果您能够将输入限制为仅限 ASCII 的字符串,这并不是特别复杂。

Since Rust 1.23, str has a make_ascii_uppercase method (in older Rust versions, it was available through the AsciiExt trait).从 Rust 1.23 开始, str有一个make_ascii_uppercase方法(在旧的 Rust 版本中,它可以通过AsciiExt特性获得)。 This means you can uppercase ASCII-only string slices with relative ease:这意味着您可以相对轻松地将纯 ASCII 字符串切片大写:

fn make_ascii_titlecase(s: &mut str) {
    if let Some(r) = s.get_mut(0..1) {
        r.make_ascii_uppercase();
    }
}

This will turn "taylor" into "Taylor" , but it won't turn "édouard" into "Édouard" .这会将"taylor"变成"Taylor" ,但不会将"édouard"变成"Édouard" ( playground ) 操场

Use with caution.谨慎使用。

I did it this way:我是这样做的:

fn str_cap(s: &str) -> String {
  format!("{}{}", (&s[..1].to_string()).to_uppercase(), &s[1..])
}

If it is not an ASCII string:如果它不是 ASCII 字符串:

fn str_cap(s: &str) -> String {
  format!("{}{}", s.chars().next().unwrap().to_uppercase(), 
  s.chars().skip(1).collect::<String>())
}

I agree with the question guy. 我同意这个问题的家伙。 So, I have made it by my own way: 所以,我用自己的方式做到了:

fn capitalize(word: &str) -> String {
    let mut output = String::with_capacity(word.len());
    let (first, last) = word.split_at(1);
    let first_letter = format!("{}", first.to_uppercase());
    output.push_str(first_letter.as_str());
    output.push_str(last);
    output
}

fn main() {
    let input = "end";
    let ret = capitalize(input);
    println!("{} -> {}", input, ret);
}

Here's a version that is a bit slower than @Shepmaster's improved version, but also more idiomatic :这是一个比@Shepmaster 的改进版本慢一点但也更惯用的版本:

fn capitalize_first(s: &str) -> String {
    let mut chars = s.chars();
    chars
        .next()
        .map(|first_letter| first_letter.to_uppercase())
        .into_iter()
        .flatten()
        .chain(chars)
        .collect()
}

This is how I solved this problem, notice I had to check if self is not ascii before transforming to uppercase.这就是我解决这个问题的方法,注意在转换为大写之前我必须检查 self 是否不是 ascii。

trait TitleCase {
    fn title(&self) -> String;
}

impl TitleCase for &str {
    fn title(&self) -> String {
        if !self.is_ascii() || self.is_empty() {
            return String::from(*self);
        }
        let (head, tail) = self.split_at(1);
        head.to_uppercase() + tail
    }
}

pub fn main() {
    println!("{}", "bruno".title());
    println!("{}", "b".title());
    println!("{}", "🦀".title());
    println!("{}", "ß".title());
    println!("{}", "".title());
    println!("{}", "བོད་སྐད་ལ".title());
}

Output输出

Bruno
B
🦀
ß

བོད་སྐད་ལ 

The OP's approach taken further: OP的方法进一步采取:
replace the first character with its uppercase representation用大写表示替换第一个字符

let mut s = "foobar".to_string();
for i in 1..4 {
    if s.is_char_boundary(i) {
        let u = &s[0..i].to_uppercase();
        s.replace_range(..i, u);
        break;
    }
}
println!("{}", s);

There is no need to check whether the string s is empty, because is_char_boundary doesn't panic if index i is greater than s.len() .不需要检查字符串s是否为空,因为如果索引i大于s.len()is_char_boundary不会恐慌。

Inspired by get_mut examples I code something like this:get_mut 示例的启发,我编写了如下代码:

fn make_capital(in_str : &str) -> String {
    let mut v = String::from(in_str);
    v.get_mut(0..1).map(|s| { s.make_ascii_uppercase(); &*s });

    v
}

Since the method to_uppercase() returns a new string, you should be able to just add the remainder of the string like so.由于to_uppercase()方法返回一个新字符串,您应该能够像这样添加字符串的其余部分。

this was tested in rust version 1.57+ but is likely to work in any version that supports slice.这在 rust 版本 1.57+ 中进行了测试,但很可能在任何支持 slice 的版本中工作。

fn uppercase_first_letter(s: &str) -> String {
        s[0..1].to_uppercase() + &s[1..]
    }

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM