简体   繁体   English

使用 java 8 stream 将字符串中单词的首字母大写

[英]Capitalize first letters in words in the string with different separators using java 8 stream

I need to capitalize first letter in every word in the string, BUT it's not so easy as it seems to be as the word is considered to be any sequence of letters, digits, "_", "-", "`" while all other chars are considered to be separators, ie after them the next letter must be capitalized.我需要将字符串中每个单词的第一个字母大写,但这并不像看起来那么容易,因为单词被认为是任何字母、数字、“_”、“-”、“`”序列,而所有其他字符被视为分隔符,即在它们之后的下一个字母必须大写。

Example what program should do:示例程序应该做什么:

For input: "#he&llo wo!r^ld"输入:“#he&llo wo!r^ld”

Output should be: "#He&Llo Wo!R^Ld" Output 应该是:“#He&Llo Wo!R^Ld”

There are questions that sound similar here, but there solutions really don't help.这里有些问题听起来很相似,但解决方案确实无济于事。 This one for example:比如这个:

String output = Arrays.stream(input.split("[\\s&]+"))
                    .map(t -> t.substring(0, 1).toUpperCase() + t.substring(1))
                    .collect(Collectors.joining(" "));

As in my task there can be various separators, this solution doesn't work.在我的任务中可能有各种分隔符,这个解决方案不起作用。

It is possible to split a string and keep the delimiters , so taking into account the requirement for delimiters:可以拆分字符串并保留分隔符,因此考虑到分隔符的要求:

word is considered to be any sequence of letters, digits, "_", "-", "`" while all other chars are considered to be separators单词被认为是字母、数字、“_”、“-”、“`”的任何序列,而所有其他字符被认为是分隔符

the pattern which keeps the delimiters in the result array would be: "((?<=[^-`\\w])|(?=[^-`\\w]))" :将分隔符保留在结果数组中的模式是: "((?<=[^-`\\w])|(?=[^-`\\w]))"

[^-`\\w] : all characters except - , backtick and word characters \w : [A-Za-z0-9_] [^-`\\w] :除- ,反引号和单词字符之外的所有字符\w[A-Za-z0-9_]

Then, the "words" are capitalized, and delimiters are kept as is:然后,“单词”大写,分隔符保持原样:

static String capitalize(String input) {
    if (null == input || 0 == input.length()) {
        return input;
    }
    return Arrays.stream(input.split("((?<=[^-`\\w])|(?=[^-`\\w]))"))
                 .map(s -> s.matches("[-`\\w]+") ? Character.toUpperCase(s.charAt(0)) + s.substring(1) : s)
                 .collect(Collectors.joining(""));
}

Tests:测试:

System.out.println(capitalize("#he&l_lo-wo!r^ld"));
System.out.println(capitalize("#`he`&l+lo wo!r^ld"));

Output: Output:

#He&l_lo-wo!R^Ld
#`he`&L+Lo Wo!R^Ld

Update更新
If it is needed to process not only ASCII set of characters but apply to other alphabets or character sets (eg Cyrillic, Greek, etc.), POSIX class \\p{IsWord} may be used and matching of Unicode characters needs to be enabled using pattern flag (?U) :如果不仅需要处理 ASCII 字符集,还需要处理其他字母或字符集(例如 Cyrillic、Greek 等),可以使用 POSIX class \\p{IsWord}并且需要启用 Unicode 字符的匹配使用模式标志(?U)

static String capitalizeUnicode(String input) {
    if (null == input || 0 == input.length()) {
        return input;
    }
    
    return Arrays.stream(input.split("(?U)((?<=[^-`\\p{IsWord}])|(?=[^-`\\p{IsWord}]))")
                 .map(s -> s.matches("(?U)[-`\\p{IsWord}]+") ? Character.toUpperCase(s.charAt(0)) + s.substring(1) : s)
                 .collect(Collectors.joining(""));
}

Test:测试:

System.out.println(capitalizeUnicode("#he&l_lo-wo!r^ld"));
System.out.println(capitalizeUnicode("#привет&`ёж`+дос^βιδ/ως"));

Output: Output:

#He&L_lo-wo!R^Ld
#Привет&`ёж`+Дос^Βιδ/Ως

You can't use split that easily - split will eliminate the separators and give you only the things in between.你不能那么容易地使用 split - split 将消除分隔符并只给你介于两者之间的东西。 As you need the separators, no can do.由于您需要分隔符,因此无能为力。

One real dirty trick is to use something called 'lookahead'.一个真正肮脏的技巧是使用一种叫做“前瞻”的东西。 That argument you pass to split is a regular expression.你传递给 split 的那个参数是一个正则表达式。 Most 'characters' in a regexp have the property that they consume the matching input.正则表达式中的大多数“字符”都具有消耗匹配输入的属性。 If you do input.split("\\s+") then that doesn't 'just' split on whitespace, it also consumes them: The whitespace is no longer part of the individual entries in your string array.如果您执行input.split("\\s+")那么这不会“仅”拆分空格,它还会消耗它们:空格不再是字符串数组中各个条目的一部分。

However, consider ^ and $ .但是,请考虑^$ or \\b .\\b These still match things but don't consume anything.这些仍然匹配事物,但不消耗任何东西。 You don't consume 'end of string'.您不使用“字符串结尾”。 In fact, ^^^hello$$$ matches the string "hello" just as well.事实上, ^^^hello$$$也匹配字符串"hello" You can do this yourself, using lookahead : It matches when the lookahead is there but does not consume it:您可以自己执行此操作,使用前瞻:当前瞻存在但使用它时匹配:

String[] args = "Hello World$Huh   Weird".split("(?=[\\s_$-]+)");
for (String arg : args) System.out.println("*" + args[i] + "*");

Unfortunately, this 'works', in that it saves your separators, but isn't getting you all that much closer to a solution:不幸的是,这“有效”,因为它节省了您的分隔符,但并没有让您更接近解决方案:

*Hello*
* World*
*$Huh*
* *
* *
* Weird*

You can go with lookbehind as well, but it's limited;您也可以使用后视功能 go,但它是有限的; they don't do variable length, for example.例如,他们不做可变长度。

The conclusion should rapidly become: Actually, doing this with split is a mistake.结论应该很快变成:实际上,用split这样做是一个错误。

Then, once split is off the table, you should no longer use streams, either: Streams don't do well once you need to know stuff about the previous element in a stream to do the job: A stream of characters doesn't work, as you need to know if the previous character was a non-letter or not.然后,一旦拆分不再使用,您也不应该再使用流:一旦您需要了解有关 stream 中前一个元素的信息来完成这项工作,流就不会做得很好:字符的 stream 不起作用,因为您需要知道前一个字符是否为非字母。

In general, "I want to do X, and use Y" is a mistake.一般来说,“我想做 X,并使用 Y”是一个错误。 Keep an open mind.保持开放的心态。 It's akin to asking: "I want to butter my toast, and use a hammer to do it".这类似于问:“我想在吐司上涂黄油,然后用锤子来做”。 Oookaaaaayyyy, you can probably do that, but, eh, why? Oookaaaaayyyy,您可能可以这样做,但是,呃,为什么? There are butter knives right there in the drawer, just.. put down the hammer, that's toast.抽屉里有黄油刀,只要……放下锤子,就是吐司。 Not a nail.不是钉子。

Same here.同样在这里。

A simple loop can take care of this, no problem:一个简单的循环可以解决这个问题,没问题:

private static final String BREAK_CHARS = "&-_`";

public String toTitleCase(String input) {
  StringBuilder out = new StringBuilder();
  boolean atBreak = true;
  for (char c : input.toCharArray()) {
    out.append(atBreak ? Character.toUpperCase(c) : c);
    atBreak = Character.isWhitespace(c) || (BREAK_CHARS.indexOf(c) > -1);
  }
  return out.toString();
}

Simple.简单的。 Efficient.高效的。 Easy to read.易于阅读。 Easy to modify.易于修改。 For example, if you want to go with 'any non-letter counts', trivial: atBreak = Character.isLetter(c);例如,如果你想 go 与“任何非字母计数”,微不足道: atBreak = Character.isLetter(c); . .

Contrast to the stream solution which is fragile, weird, far less efficient, and requires a regexp that needs half a page's worth of comment for anybody to understand it.与 stream 解决方案相比,该解决方案脆弱、奇怪、效率低得多,并且需要一个需要半页评论的正则表达式才能让任何人理解它。

Can you do this with streams?能用流做到这一点吗? Yes.是的。 You can butter toast with a hammer, too.你也可以用锤子在吐司上涂黄油。 Doesn't make it a good idea though.但这并不是一个好主意。 Put down the hammer!放下锤子!

You can use a simple FSM as you iterate over the characters in the string, with two states, either in a word, or not in a word.您可以在迭代字符串中的字符时使用简单的 FSM,具有两种状态,或者在一个单词中,或者不在一个单词中。 If you are not in a word and the next character is a letter, convert it to upper case, otherwise, if it is not a letter or if you are already in a word, simply copy it unmodified.如果您不在一个单词中并且下一个字符是字母,则将其转换为大写,否则,如果它不是字母或您已经在一个单词中,只需将其复制原样。

boolean isWord(int c) {
    return c == '`' || c == '_' || c == '-' || Character.isLetter(c) || Character.isDigit(c);
}

String capitalize(String s) {
    StringBuilder sb = new StringBuilder();
    boolean inWord = false;
    for (int c : s.codePoints().toArray()) {
        if (!inWord && Character.isLetter(c)) {
            sb.appendCodePoint(Character.toUpperCase(c));
        } else {
            sb.appendCodePoint(c);
        }
        inWord = isWord(c);
    }
    return sb.toString();
}

Note: I have used codePoints() , appendCodePoint(int) , and int so that characters outside the basic multilingual plane (with code points greater than 64k) are handled correctly.注意:我使用了codePoints()appendCodePoint(int)int以便正确处理基本多语言平面之外的字符(代码点大于 64k)。

I need to capitalize first letter in every word我需要将每个单词的首字母大写

Here is one way to do it.这是一种方法。 Admittedly this is a might longer but your requirement to change the first letter to upper case (not first digit or first non-letter) required a helper method.诚然,这可能更长,但您将第一个字母更改为大写(不是第一个数字或第一个非字母)的要求需要一个辅助方法。 Otherwise it would have been easier.否则会更容易。 Some others seemed to have missed this point.其他一些人似乎忽略了这一点。

Establish word pattern, and test data.建立字型,并测试数据。

String wordPattern = "[\\w_-`]+";
Pattern p = Pattern.compile(wordPattern);
String[] inputData = { "#he&llo wo!r^ld", "0hel`lo-w0rld" };

Now this simply finds each successive word in the string based on the established regular expression.现在这只是根据已建立的正则表达式在字符串中找到每个连续的单词。 As each word is found, it changes the first letter in the word to upper case and then puts it in a string buffer in the correct position where the match was found.找到每个单词后,它将单词中的第一个字母更改为大写,然后将其放入找到匹配项的正确 position 的字符串缓冲区中。

for (String input : inputData) {
    StringBuilder sb = new StringBuilder(input);
    Matcher m = p.matcher(input);
    while (m.find()) {
        sb.replace(m.start(), m.end(),
                upperFirstLetter(m.group()));
    }
    System.out.println(input + " -> " + sb);
}

prints印刷

#he&llo wo!r^ld -> #He&Llo Wo!R^Ld
0hel`lo-w0rld -> 0Hel`lo-W0rld

Since words may start with digits, and the requirement was to convert the first letter (not character) to upper case.由于单词可能以数字开头,并且要求将第一个字母(不是字符)转换为大写。 This method finds the first letter, converts it to upper case and returns the new string.此方法找到第一个字母,将其转换为大写并返回新字符串。 So 01_hello would become 01_Hello所以01_hello会变成01_Hello

    
public static String upperFirstLetter(String word) {
    char[] chs = word.toCharArray();
    for (int i = 0; i < chs.length; i++) {
        if (Character.isLetter(chs[i])) {
            chs[i] = Character.toUpperCase(chs[i]);
            break;
        }
    }
    return String.valueOf(chs);
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM