简体   繁体   English

拆分字符串并保留字符(正则表达式模式)

[英]splitting string and keep characters (regex pattern)

I would like to split a String and despair on the regex pattern.我想在正则表达式模式上拆分字符串和绝望。

I need to split a string like this: Hi I want "to split" this (String) to a String array like this:我需要像这样拆分字符串: Hi I want "to split" this (String)的字符串数组:

String [] array = {"Hi", "I", "want", """, "to", "split", """, "this", "(", "string", ")"};

This is what I have tried, but it deletes the delimiter.这是我尝试过的,但它删除了分隔符。

public static void main(String[] args) {

    String string = "Hi \"why should\" (this work)";

    String[] array;
    array = string.split("\\s"
            + "|\\s(?=\")"
            + "|\\w(?=\")"
            + "|\"(?=\\w)"
            + "|\\s(?=\\()"
            + "|\\w(?=\\))"
            + "|\\((?=\\w)");

    for (String str : array) {
        System.out.println(str);
    }
}

Result:结果:

Hi

why
shoul
"

this
wor
)

You can match the tokens with the regex \\w+|[\\w\\s] , assuming that you want the punctuation characters to end up in different tokens:您可以将标记与正则表达式\\w+|[\\w\\s]匹配,假设您希望标点字符以不同的标记结尾:

String input = "Hi I want \"to split\" this (String).";

Matcher matcher = Pattern.compile("\\w+|[^\\w\\s]").matcher(input);
List<String> out = new ArrayList<>();

while (matcher.find()) {
    out.add(matcher.group());
}

The output ArrayList contains:输出 ArrayList 包含:

[Hi, I, want, ", to, split, ", this, (, String, ), .]

You might want to use (?U) flag to make the \\w and \\s follows the Unicode definition of word and whitespace character.您可能希望使用(?U)标志使\\w\\s遵循单词和空白字符的 Unicode 定义。 By default, \\w and \\s only recognizes word and whitespace characters in ASCII range.默认情况下, \\w\\s仅识别ASCII 范围内的单词和空白字符。


For the sake of completeness, here is the solution in split() , which works on Java 8 and above.为了完整起见,这里是split()的解决方案,它适用于 Java 8 及更高版本。 There will be an extra empty string at the beginning in Java 7.在 Java 7 的开头会有一个额外的空字符串。

String tokens[] = input.split("\\s+|(?<![\\w\\s])(?=\\w)|(?<=\\w)(?![\\w\\s])|(?<=[^\\w\\s])(?=[^\\w\\s])");

The regex is rather complex, since the empty string splits between punctuation character and word character need to avoid the cases already split by \\s+ .正则表达式相当复杂,因为空字符串在标点字符和单词字符之间拆分需要避免已经被\\s+拆分的情况。

Since the regex in the split solution is quite a mess, please use the match solution instead .由于拆分解决方案中的正则表达式相当混乱,请改用匹配解决方案

What language are you trying to write this in?你想用什么语言来写这个?

You could write regex groups something like: (.+)(\\s)您可以编写如下正则表达式组: (.+)(\\s)

This would match any quantity of characters followed by a space这将匹配任意数量的字符后跟一个空格

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM