繁体   English   中英

Java - 将字符串拆分为具有字符限制的句子

[英]Java - Split String into sentences with character limitation

我想将文本拆分成句子(由.或 BreakIterator 拆分)。 但是:每个句子不得超过 100 个字符。

例子:

Lorem ipsum dolor sit. Amet consetetur sadipscing elitr,
sed diam nonumy eirmod tempor invidunt ut labore et dolore
magna aliquyam erat, sed diam voluptua. At vero eos et accusam
et justo duo dolores.

To:(3个要素,不打断一个词,而是一个句子)

" Lorem ipsum dolor sit. ",
" Amet consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt
  ut labore et dolore magna",
" aliquyam erat, sed diam voluptua. At vero eos et accusam
  et justo duo dolores. "

我怎样才能正确地做到这一点?

可能有更好的方法来做到这一点,但它是这样的:

public static void main(String... args) {

    String originalString = "Lorem ipsum dolor sit. Amet consetetur sadipscing elitr,sed diam nonumy eirmod tempor invidunt ut labore "
            + "et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores.";


    String[] s1 = originalString.split("\\.");
    List<String> list = new ArrayList<String>();

    for (String s : s1)
        if (s.length() > 100)
            list.addAll(Arrays.asList(s.split("(?<=\\G.{100})")));
        else
            list.add(s);

    System.out.println(list);
}

“split string in size”正则表达式来自这个 SO question 您可能可以整合两个正则表达式,但我不确定这是否是一个明智的主意(:

如果正则表达式不在 Andrond 中运行( \\G运算符在任何地方都无法识别),请尝试链接到根据字符串大小拆分字符串的其他解决方案

在这种情况下,正则表达式不会对您有很大帮助。

我会使用空格或. 然后开始连接。 像这样的东西:

伪代码

words = text.split("[\s\.]");
lines = new List();
while ( words.length() > 0 ) {

  String line = new String();
  while ( line.length() + words.get(0).length() < 100 ) {
    line += words.get(0);
    words.remove(words.get(0));
  }

  lines.add(line);

}

已解决(感谢 Macarse 的启发):

String[] words = text.split("(?=[\\s\\.])");
ArrayList<String> array = new ArrayList<String>();
int i = 0;
while (words.length > i) {
    String line = "";
    while ( words.length > i && line.length() + words[i].length() < 100 ) {
        line += words[i];
        i++;
    }
    array.add(line);
}

按照之前的解决方案,我很快陷入了一个无限循环的问题,当每个单词可能超过限制时(非常不可能,但不幸的是我的环境非常受限)。 所以,我为这个边缘情况添加了一个修复(有点)(我认为)。

import java.util.*;

public class Main
{
    public static void main(String[] args) {
        sentenceToLines("In which of the following, a person is constantly followed/chased by another person or group of several people?", 15);
    }

    private static ArrayList<String> sentenceToLines(String s, int limit) {
        String[] words = s.split("(?=[\\s\\.])");
        ArrayList<String> wordList =  new ArrayList<String>(Arrays.asList(words));
        ArrayList<String> array = new ArrayList<String>();
        int i = 0, temp;
        String word, line;
        while (i < wordList.size()) {
            line = "";
            temp = i;
            // split the long words to the size of the limit
            while(wordList.get(i).length() > limit) {
                word = wordList.get(i);
                wordList.add(i++, word.substring(0, limit));
                wordList.add(i, word.substring(limit));
                wordList.remove(i+1);
            }
            i = temp;
            // continue making lines with newly split words
            while ( i < wordList.size() && line.length() + wordList.get(i).length() <= limit ) {
                line += wordList.get(i);
                i++;
            }
            System.out.println(line.trim());
            array.add(line.trim());
        }
        return array;
    }
    
}

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM