简体   繁体   English

将java中的字符串拆分为等长的子字符串,同时保持单词边界

[英]split a string in java into equal length substrings while maintaining word boundaries

How to split a string into equal parts of maximum character length while maintaining word boundaries?如何在保持单词边界的同时将字符串拆分为最大字符长度的相等部分?

Say, for example, if I want to split a string "hello world" into equal substrings of maximum 7 characters it should return me例如,如果我想将字符串“hello world”拆分为最多 7 个字符的相等子字符串,它应该返回我

"hello "

and

"world"

But my current implementation returns但我当前的实现返回

"hello w"

and

"orld   "

I am using the following code taken from Split string to equal length substrings in Java to split the input string into equal parts我使用以下代码从Split string to equal length substrings in Java将输入字符串分成相等的部分

public static List<String> splitEqually(String text, int size) {
    // Give the list the right capacity to start with. You could use an array
    // instead if you wanted.
    List<String> ret = new ArrayList<String>((text.length() + size - 1) / size);

    for (int start = 0; start < text.length(); start += size) {
        ret.add(text.substring(start, Math.min(text.length(), start + size)));
    }
    return ret;
}

Will it be possible to maintain word boundaries while splitting the string into substring?将字符串拆分为子字符串时是否可以保持单词边界?

To be more specific I need the string splitting algorithm to take into account the word boundary provided by spaces and not solely rely on character length while splitting the string although that also needs to be taken into account but more like a max range of characters rather than a hardcoded length of characters.更具体地说,我需要字符串拆分算法来考虑空格提供的单词边界,而不仅仅是在拆分字符串时依赖字符长度,尽管这也需要考虑在内,但更像是字符的最大范围而不是硬编码的字符长度。

If I understand your problem correctly then this code should do what you need (but it assumes that maxLenght is equal or greater than longest word )如果我正确理解您的问题,那么此代码应该maxLenght您的需求(但它假定maxLenght等于或大于最长单词

String data = "Hello there, my name is not importnant right now."
        + " I am just simple sentecne used to test few things.";
int maxLenght = 10;
Pattern p = Pattern.compile("\\G\\s*(.{1,"+maxLenght+"})(?=\\s|$)", Pattern.DOTALL);
Matcher m = p.matcher(data);
while (m.find())
    System.out.println(m.group(1));

Output:输出:

Hello
there, my
name is
not
importnant
right now.
I am just
simple
sentecne
used to
test few
things.

Short (or not) explanation of "\\\\G\\\\s*(.{1,"+maxLenght+"})(?=\\\\s|$)" regex: "\\\\G\\\\s*(.{1,"+maxLenght+"})(?=\\\\s|$)"正则表达式的简短(或不)解释:

(lets just remember that in Java \\ is not only special in regex, but also in String literals, so to use predefined character sets like \\d we need to write it as "\\\\d" because we needed to escape that \\ also in string literal) (让只记得在Java中\\不仅regex的特殊,而且在字符串常量,所以像使用预定义的字符集\\d ,我们需要把它写成"\\\\d"因为我们需要转义\\也字符串字面量)

  • \\G - is anchor representing end of previously founded match, or if there is no match yet (when we just started searching) beginning of string (same as ^ does) \\G - 是代表先前建立的匹配结束的锚点,或者如果还没有匹配(当我们刚开始搜索时)字符串的开头(与^相同)
  • \\s* - represents zero or more whitespaces ( \\s represents whitespace, * "zero-or-more" quantifier) \\s* - 代表零个或多个空格( \\s代表空格, * “零个或多个”量词)
  • (.{1,"+maxLenght+"}) - lets split it in more parts (at runtime :maxLenght will hold some numeric value like 10 so regex will see it as .{1,10} ) (.{1,"+maxLenght+"}) - 让我们把它分成更多部分(在运行时:maxLenght会保存一些像 10 这样的数值,所以正则表达式会将它看作.{1,10}
    • . represents any character (actually by default it may represent any character except line separators like \\n or \\r , but thanks to Pattern.DOTALL flag it can now represent any character - you may get rid of this method argument if you want to start splitting each sentence separately since its start will be printed in new line anyway )代表任何字符(实际上默认情况下它可以代表除\\n\\r等行分隔符之外的任何字符,但多亏了Pattern.DOTALL标志,它现在可以代表任何字符 -如果你想开始拆分,你可以摆脱这个方法参数无论如何,每个句子都会单独打印,因为它的开头将在新行中打印
    • {1,10} - this is quantifier which lets previously described element appear 1 to 10 times (by default will try to find maximal amout of matching repetitions), {1,10} - 这是一个量词,它让前面描述的元素出现 1 到 10 次(默认情况下会尝试找到匹配重复的最大数量),
    • .{1,10} - so based on what we said just now, it simply represents "1 to 10 of any characters" .{1,10} - 所以根据我们刚才所说的,它只是代表“1到10个任意字符”
    • ( ) - parenthesis create groups , structures which allow us to hold specific parts of match (here we added parenthesis after \\\\s* because we will want to use only part after whitespaces) ( ) - 括号创建,结构允许我们保存匹配的特定部分(这里我们在\\\\s*后添加括号,因为我们只想在空格后使用部分)
  • (?=\\\\s|$) - is look-ahead mechanism which will make sure that text matched by .{1,10} will have after it: (?=\\\\s|$) - 是 一种前瞻机制,它将确保与.{1,10}匹配的文本将在它之后:

    • space ( \\\\s )空格 ( \\\\s )

      OR (written as | )或(写作|

    • end of the string $ after it.在它之后的字符串$的结尾。

So thanks to .{1,10} we can match up to 10 characters.所以多亏了.{1,10}我们最多可以匹配 10 个字符。 But with (?=\\\\s|$) after it we require that last character matched by .{1,10} is not part of unfinished word (there must be space or end of string after it).但是在(?=\\\\s|$)之后,我们要求与.{1,10}匹配的最后一个字符不是未完成单词的一部分(后面必须有空格或字符串结尾)。

Non-regex solution, just in case someone is more comfortable (?) not using regular expressions:非正则表达式解决方案,以防万一有人更舒服(?)不使用正则表达式:

private String justify(String s, int limit) {
    StringBuilder justifiedText = new StringBuilder();
    StringBuilder justifiedLine = new StringBuilder();
    String[] words = s.split(" ");
    for (int i = 0; i < words.length; i++) {
        justifiedLine.append(words[i]).append(" ");
        if (i+1 == words.length || justifiedLine.length() + words[i+1].length() > limit) {
            justifiedLine.deleteCharAt(justifiedLine.length() - 1);
            justifiedText.append(justifiedLine.toString()).append(System.lineSeparator());
            justifiedLine = new StringBuilder();
        }
    }
    return justifiedText.toString();
}

Test:测试:

String text = "Long sentence with spaces, and punctuation too. And supercalifragilisticexpialidocious words. No carriage returns, tho -- since it would seem weird to count the words in a new line as part of the previous paragraph's length.";
System.out.println(justify(text, 15));

Output:输出:

Long sentence
with spaces,
and punctuation
too. And
supercalifragilisticexpialidocious
words. No
carriage
returns, tho --
since it would
seem weird to
count the words
in a new line
as part of the
previous
paragraph's
length.

It takes into account words that are longer than the set limit, so it doesn't skip them (unlike the regex version which just stops processing when it finds supercalifragilisticexpialidosus ).它考虑了比设置限制长的单词,因此它不会跳过它们(与正则表达式版本不同,它在找到supercalifragilisticexpialidosus时才停止处理)。

PS: The comment about all input words being expected to be shorter than the set limit, was made after I came up with this solution ;) PS:关于所有输入词预计短于设定限制的评论是在我想出这个解决方案后做出的;)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM