简体   繁体   English

在Java中使用RegEx时出现模式错误

[英]Pattern Error while using RegEx in Java

I am stuck up with a problem while using Regular Expression. 使用正则表达式时,我遇到了一个问题。 My requirement is : split a long string into maximum size of 125 letters and then insert a line break in between them. 我的要求是:将一个长字符串拆分为最大大小为125个字母,然后在它们之间插入换行符。 while splitting, it shouldn't split between the words. 拆分时,不应在单词之间拆分。 in short, i want to split a string into small strings whose length is 125 or at the end of word before 125th letter. 简而言之,我想将一个字符串拆分为长度为125或在第125个字母之前的单词结尾的小字符串。 Hope i didnt confused 希望我没有困惑

i used one regexp to solve this, and believe me am an absolute zero in this. 我用一个正则表达式来解决这个问题,并相信我绝对是零。 i just got one code and copy pasted ;-) 我只得到一个代码并复制粘贴了;-)

StringBuffer result = null;  
while(mailBody.trim().length() > 0){  
    Matcher m = Pattern.compile("^.{0,125}\\b").matcher(mailBody);  
    m.find();  
    String oneLineString = m.group(0);  
    if(result == null)  
        result = new StringBuffer(oneLineString);  
    else  
        result.append("\n"+ oneLineString);  
    mailBody = mailBody.substring(oneLineString.length(),
                                  mailBody.length()).trim();  
}    

this is my code, and it's working perfectly unless the starting string ends with a full stop(.). 这是我的代码,除非起始字符串以全stop(。)结尾,否则它运行良好。 In that case it is giving an error like : No match found. 在这种情况下,它会显示错误消息:未找到匹配项。

Please help. 请帮忙。

Regards, Anoop PK 问候,Anoop PK

I cannot yet comment, the answers given are good. 我无法发表评论,给出的答案很好。 I would add that you should initialize your StringBuffer before the loop and to reduce copying, start it at least as large as your original string, like so: 我要补充一点,您应该在循环之前初始化StringBuffer并减少复制,并至少将其大小与原始字符串一样大,如下所示:

StringBuffer result = new StringBuffer(mailBody.length());

Then in the loop there would be no need to check for result == null . 然后,在循环中将无需检查result == null

Edit: Comment on PSpeed answer... Needs to add new lines in each new line added to match the original, something like this (assuming result is already initialized as I suggest): 编辑:对PSpeed答案进行评论...需要在添加的每个新行中添加新行以匹配原始行,类似这样(假设结果已经按照我的建议进行了初始化):

while (m.find()) {
    if (result.length() > 0)
        result.append("\n");
    result.append(m.group().trim());
}

Can you try using the following instead? 您可以尝试使用以下内容吗?

Matcher m = Pattern.compile("(?:^.{0,125}\\b)|(?:^.{0,125}$)").matcher(mailBody);  

Here we use your original match OR we match a string whose total length is 125 characters or fewer. 在这里,我们使用您的原始匹配项,或者匹配总长度不超过125个字符的字符串。 The (?:X) items are non-capturing groups, so that I can use the | (?:X)项目是非捕获组,因此我可以使用| operator on the large groups. 大组上的运算符。

( See documentation for the Pattern class here .) 请参见Pattern类的文档 。)


Addendum: @Anoop: Quite right, having sentence-ending punctuation left off on its own line is undesirable behavior. 附录: @Anoop:非常正确,不希望将句子结尾的标点留在自己的行上。 You can try this instead: 您可以尝试以下方法:

if(result == null)  
   result = new StringBuffer("");

mailBody = mailBody.trim();

while(mailBody.length() > 125) {

    // Try not to break immediately before closing punctuation
    Matcher m = Pattern.compile("^.{1,125}\\b(?![-\\.?;&)])").matcher(mailBody);
    String oneLineString;

    // Found a safe place to break string
    if (m.find()) {

        oneLineString = m.group(0);

    // Forced to break string in an ugly fashion
    } else {

        // Try to break at any word boundary at least
        m = Pattern.compile("^.{1,125}\\b").matcher(mailBody);

        if (m.find()) {

            oneLineString = m.group(0);

        // Last ditch scenario, just break at 125 characters
        } else {

            oneLineString = mailBody.substring(0,124);

        }

    }

    result.append(oneLineString + "\n");
    mailBody = mailBody.substring(oneLineString.length(),
                                  mailBody.length()).trim();  
}

result.append(mailBody);

与其直接使用正则表达式,不如考虑使用java.text.BreakIterator -这是它的设计目的。

First, you can technically get the same results with a simpler pattern and the lookingAt() method which makes your intent more obvious. 首先,从技术上讲,您可以使用更简单的模式和lookingAt()方法获得相同的结果,从而使您的意图更加明显。 Also, it's good to pull the pattern compilation out of the loop. 另外,最好将模式编译从循环中拉出来。

I think your regex is nice and simple though you might want to explicitly define what you mean by a word break rather than relying on what word boundary means. 我认为您的正则表达式既好又简单,尽管您可能想显式地定义一个分词所代表的意思,而不是依赖于单词边界的含义。 It sounds like you want to capture the period and break after but the \\b won't do that. 听起来您想捕获时间段并在之后休息,但是\\ b不会这样做。 You can instead break on whitespace... 您可以改为在空白处...

Edit: Even simpler now... 编辑:现在更简单...

StringBuilder result = null;  
Pattern pattern = Pattern.compile( ".{0,125}\\s|.{0,125}" );
Matcher m = pattern.matcher(mailBody);
while( m.find() ) {
    String s = m.group(0).trim();
    if( result == null ) {
        result = new StringBuilder(s);  
    } else {
        result.append(s);
    }
}

...I think the new improved edits are even simpler and still do what you want. ...我认为新的改进的编辑甚至更简单,并且仍然可以执行您想要的操作。

The pattern can be adjusted if there are other characters that would be considered breakable characters: 如果还有其他字符会被视为易碎字符,则可以调整模式:

Pattern.compile( ".{0,125}[\\s+&]|.{0,125}" );

...and so on. ...等等。 That would allow breaks on whitespace, + chars, and & chars as an example. 以空格为例,可以使用+字符和&字符。

The exception isn't being caused by your regex, it's because you're using the API incorrectly. 异常不是由正则表达式引起的,这是因为您使用API​​的方式不正确。 You're supposed to check the return value of the find() method before you call group() -- that's how you know the match succeeded. 您应该在调用group()之前检查find()方法的返回值-这样您才能知道匹配成功。

EDIT: Here's what's happening: when you get to the final chunk of text, the regex originally matches all the way to the end. 编辑:这是发生的事情:当您到达最后一块文本时,正则表达式最初一直匹配到结尾。 But \\b can't match at that position because the last character is a period (or full stop), not a word character. 但是\\b在该位置不能匹配,因为最后一个字符是句点(或句号),而不是单词字符。 So it backtracks one position, and then \\b can match between the final letter and the period. 因此它回溯一个位置,然后\\b可以在最后一个字母和句点之间匹配。

Then it tries to match another chunk because mailBody.trim().length() is still greater than zero. 然后它尝试匹配另一个块,因为mailBody.trim().length()仍然大于零。 But this time there are no word characters at all, so the match attempt fails and m.find() returns false . 但是这次完全没有单词字符,因此匹配尝试失败,并且m.find()返回false But you don't check the return value, you just go ahead and call m.group(0) , which correctly throws an exception. 但是,您无需检查返回值,只需继续调用m.group(0) ,它将正确引发异常。 You should be using m.find() as the while condition, not that business with the string length. 您应该将m.find()用作while条件,而不要使用字符串长度作为该条件。

In fact, you're doing a lot more work than you need to; 实际上,您要做的工作比您需要做的要多得多。 if you use the API correctly you can reduce your code to one line: 如果正确使用API​​,则可以将代码减少到一行:

mailBody = mailBody.replaceAll(
    "\\G(\\w{125}|.{1,123}(?<=\\w\\b)[.,!?;:/\"-]*)\\s*",
    "$1\n" ).trim();

The regex isn't perfect--I don't think that's possible--but it might do well enough. 正则表达式不是完美的-我认为这是不可能的-但它可能做得很好。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM