简体   繁体   English

如何拆分字符串并保留特定的分隔符?

[英]How to split a string and keep specific delimiters?

I was writing some code which needed to accept user calculator input, so as part of it I figured I'd use regular expressions to tokenize an input string, but tokenizing the string itself fails my unit tests for decimals and "]".我正在编写一些需要接受用户计算器输入的代码,因此作为其中的一部分,我想我会使用正则表达式来标记输入字符串,但是标记字符串本身在我的小数和“]”单元测试中失败了。

I started by using the lookahead and lookbehind method that I saw here .我开始使用我在这里看到的前瞻和后视方法。

I wrote with "((?<=[+-/*(){^}[%]π])|(?=[+-/*(){^}[%]π]))";我写的是"((?<=[+-/*(){^}[%]π])|(?=[+-/*(){^}[%]π]))"; which compiled and ran successfully, except it failed if there was a number with a decimal.它编译并成功运行,但如果有一个带小数的数字则失败。

I went back and I tried it the same way the accepted answer does in the linked question using "[+-/*\\\\^%(){}[]]" (regex3 below) both with and without the π because my first instinct would be the character which caused the issue, but in both cases it resulted in Exception in thread "main" java.util.regex.PatternSyntaxException: Unclosed character class near index 41 ((?<=[+-/*\\^%(){}[]])|(?=[+-/*\\^%(){}[]]))我回去了,我用"[+-/*\\\\^%(){}[]]" (下面的正则表达式3)在链接问题中接受的答案以同样的方式尝试它,无论有没有π,因为我的第一个本能将是导致问题的字符,但在这两种情况下,它导致Exception in thread "main" java.util.regex.PatternSyntaxException: Unclosed character class near index 41 ((?<=[+-/*\\^%(){}[]])|(?=[+-/*\\^%(){}[]]))

At this point, I went back to my first try and rearranged the terms, "((?<=[+-/*^%(){}[]π])|(?=[+-/*^%(){}[]π]))";在这一点上,我回到我的第一次尝试并重新排列术语, "((?<=[+-/*^%(){}[]π])|(?=[+-/*^%(){}[]π]))"; (regex2 below) but this one also had the same PatternSyntaxException on the last parenthesis. (下面的regex2)但是这个在最后一个括号上也有相同的PatternSyntaxException。

It'd probably be easier to just show the problem in code, I wrote a class to run three different regex class attempts :在代码中显示问题可能更容易,我编写了一个类来运行三个不同的正则表达式类尝试:

import java.util.Arrays;
public class RegexProblem {
    /** This Delimiter string came from {@link https://stackoverflow.com/a/2206432/} */
    static public final String WITH_DELIMITER = "((?<=%1$s)|(?=%1$s))";


    // Split on and include + - * / ^ % ( ) [ ] { } π
    public static void main(String[] args) {

        String regex1="((?<=[+-/*(){^}[%]π])|(?=[+-/*(){^}[%]π]))";
        String regex2="((?<=[+-/*^%(){}[]π])|(?=[+-/*^%(){}[]π]))";
        String regex3="[+-/*\\^%(){}[]]";

        String str="1.2+3-4^5*6/(78%9π)+[{0+-1}*2]";
        String str2="[1.2+3]*4";


        String[] expected={"1.2","+","3","-","4","^","5","*","6","(","78","%",
                           "9","π",")","+","[","{","0","+","-","1","}","*","2","]"};
        String[] expected2={"[","1.2","+","3","]","*","4"};


        System.out.println("Expected: ");
        System.out.print("str: ");
        System.out.println(Arrays.toString(expected));
        System.out.print("str2: ");
        System.out.println(Arrays.toString(expected2));
        System.out.println();


        System.out.println();
        System.out.println("Regex1: ");
        System.out.print("str: ");
        System.out.println(Arrays.toString(str.split(regex1)));
        System.out.print("str2: ");
        System.out.println(Arrays.toString(str2.split(regex1)));
        System.out.println();
        System.out.println("Regex2: ");
        System.out.print("str: ");
        System.out.println(Arrays.toString(str.split(regex2)));
        System.out.print("str2: ");
        System.out.println(Arrays.toString(str2.split(regex2)));
        System.out.println();
        System.out.println("Regex3: ");
        System.out.print("str: ");
        System.out.print(Arrays.toString(str.split(String.format(WITH_DELIMITER, regex3))));
        System.out.print("str2: ");
        System.out.print(Arrays.toString(str2.split(String.format(WITH_DELIMITER, regex3))));

    }

}

Running regex2 and regex 3 both failed, but what baffles me is the behavior of regex1, which will run even though it appears to have the same amount of closing characters as the others, and splits using "."运行 regex2 和 regex 3 都失败了,但让我感到困惑的是 regex1 的行为,它会运行,即使它看起来与其他字符具有相同数量的结束字符,并使用“.”分割。 but not "]".但不是 ”]”。

Try this:尝试这个:

(?<=[^\\d.])|(?=[^\\d.])

Explanation:解释:

  • \\d is shorthand for [0-9] , so any numeral. \\d[0-9]简写,所以任何数字。
  • . within square brackets just matches a literal dot, which appears to always be part of a number in your example input.方括号内只匹配一个文字点,它似乎始终是示例输入中数字的一部分。 Therefore, [\\d.] is what we'll use to identify number characters.因此,我们将使用[\\d.]来识别数字字符。
  • [^\\d.] matches a non-number character (carat ^ negates a character class). [^\\d.]匹配非数字字符(克拉^否定字符类)。
  • (?<=[^\\d.]) matches a point that's preceded by a non-number character. (?<=[^\\d.])匹配前面有非数字字符的点。
  • Alternate (?=[^\\d.]) matches a point that's followed by a non-number character.替代(?=[^\\d.])匹配后跟非数字字符的点。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM