简体   繁体   中英

Split String at natural language breaks

Overview

I send Strings to a Text-to-Speech server that accepts a maximum length of 300 characters. Due to network latency, there may be a delay between each section of speech being returned, so I'd like to break the speech up at the most 'natural pauses' wherever possible.

Each server request costs me money, so ideally I'd send the longest string possible, up to the maximum allowed characters.

Here is my current implementation:

private static final boolean DEBUG = true;

private static final int MAX_UTTERANCE_LENGTH = 298;
private static final int MIN_UTTERANCE_LENGTH = 200;

private static final String FULL_STOP_SPACE = ". ";
private static final String QUESTION_MARK_SPACE = "? ";
private static final String EXCLAMATION_MARK_SPACE = "! ";
private static final String LINE_SEPARATOR = System.getProperty("line.separator");
private static final String COMMA_SPACE = ", ";
private static final String JUST_A_SPACE = " ";

public static ArrayList<String> splitUtteranceNaturalBreaks(String utterance) {

    final long then = System.nanoTime();

    final ArrayList<String> speakableUtterances = new ArrayList<String>();

    int splitLocation = 0;
    String success = null;

    while (utterance.length() > MAX_UTTERANCE_LENGTH) {

        splitLocation = utterance.lastIndexOf(FULL_STOP_SPACE, MAX_UTTERANCE_LENGTH);

        if (DEBUG) {
            System.out.println("(0 FULL STOP) - last index at: " + splitLocation);
        }

        if (splitLocation < MIN_UTTERANCE_LENGTH) {
            if (DEBUG) {
                System.out.println("(1 FULL STOP) - NOT_OK");
            }

            splitLocation = utterance.lastIndexOf(QUESTION_MARK_SPACE, MAX_UTTERANCE_LENGTH);

            if (DEBUG) {
                System.out.println("(1 QUESTION MARK) - last index at: " + splitLocation);
            }

            if (splitLocation < MIN_UTTERANCE_LENGTH) {
                if (DEBUG) {
                    System.out.println("(2 QUESTION MARK) - NOT_OK");
                }

                splitLocation = utterance.lastIndexOf(EXCLAMATION_MARK_SPACE, MAX_UTTERANCE_LENGTH);

                if (DEBUG) {
                    System.out.println("(2 EXCLAMATION MARK) - last index at: " + splitLocation);
                }

                if (splitLocation < MIN_UTTERANCE_LENGTH) {
                    if (DEBUG) {
                        System.out.println("(3 EXCLAMATION MARK) - NOT_OK");
                    }

                    splitLocation = utterance.lastIndexOf(LINE_SEPARATOR, MAX_UTTERANCE_LENGTH);

                    if (DEBUG) {
                        System.out.println("(3 SEPARATOR) - last index at: " + splitLocation);
                    }

                    if (splitLocation < MIN_UTTERANCE_LENGTH) {
                        if (DEBUG) {
                            System.out.println("(4 SEPARATOR) - NOT_OK");
                        }

                        splitLocation = utterance.lastIndexOf(COMMA_SPACE, MAX_UTTERANCE_LENGTH);

                        if (DEBUG) {
                            System.out.println("(4 COMMA) - last index at: " + splitLocation);
                        }

                        if (splitLocation < MIN_UTTERANCE_LENGTH) {
                            if (DEBUG) {
                                System.out.println("(5 COMMA) - NOT_OK");
                            }

                            splitLocation = utterance.lastIndexOf(JUST_A_SPACE, MAX_UTTERANCE_LENGTH);

                            if (DEBUG) {
                                System.out.println("(5 SPACE) - last index at: " + splitLocation);
                            }

                            if (splitLocation < MIN_UTTERANCE_LENGTH) {
                                if (DEBUG) {
                                    System.out.println("(6 SPACE) - NOT_OK");
                                }

                                splitLocation = MAX_UTTERANCE_LENGTH;

                                if (DEBUG) {
                                    System.out.println("(6 MAX_UTTERANCE_LENGTH) - last index at: " + splitLocation);
                                }

                            } else {
                                if (DEBUG) {
                                    System.out.println("Accepted");
                                }

                                splitLocation -= 1;
                            }
                        }
                    } else {
                        if (DEBUG) {
                            System.out.println("Accepted");
                        }

                        splitLocation -= 1;
                    }
                } else {
                    if (DEBUG) {
                        System.out.println("Accepted");
                    }
                }
            } else {
                if (DEBUG) {
                    System.out.println("Accepted");
                }
            }
        } else {
            if (DEBUG) {
                System.out.println("Accepted");
            }
        }

        success = utterance.substring(0, (splitLocation + 2));

        speakableUtterances.add(success.trim());

        if (DEBUG) {
            System.out.println("Split - Length: " + success.length() + " -:- " + success);
            System.out.println("------------------------------");
        }

        utterance = utterance.substring((splitLocation + 2)).trim();
    }

    speakableUtterances.add(utterance);

    if (DEBUG) {

        System.out.println("Split - Length: " + utterance.length() + " -:- " + utterance);

        final long now = System.nanoTime();
        final long elapsed = now - then;

        System.out.println("ELAPSED: " + TimeUnit.MILLISECONDS.convert(elapsed, TimeUnit.NANOSECONDS));

    }

    return speakableUtterances;
}

It's ugly due to being unable to use regex within lastIndexOf . Ugly aside, it's actually pretty fast.

Problems

Ideally I'd like to use regex that allows for a match on one of my first choice delimiters:

private static final String firstChoice = "[.!?" + LINE_SEPARATOR + "]\\s+";
private static final Pattern pFirstChoice = Pattern.compile(firstChoice);

And then use a matcher to resolve the position:

    Matcher matcher = pFirstChoice.matcher(input);

    if (matcher.find()) {
        splitLocation = matcher.start();
    }

My alternative in my current implementation is to store the location of each delimiter and then select the nearest to MAX_UTTERANCE_LENGTH

I've tried various methods to apply the MIN_UTTERANCE_LENGTH & MAX_UTTERANCE_LENGTH to the Pattern, so it only captures between these values and using lookarounds to reverse iterate ?<= , but this is where my knowledge starts to fail me:

private static final String poorEffort = "([.!?]{200, 298})\\\\s+");

Finally

I wonder if any of you regex masters can achieve what I'm after and confirm if in actual fact, it will prove more efficient?

I thank you in advance.

References:

I would do something like this:

Pattern p = Pattern.compile(".{1,299}(?:[.!?]\\s+|\\n|$)", Pattern.DOTALL);
Matcher matcher = p.matcher(text);
while (matcher.find()) {
    speakableUtterances.add(matcher.group().trim());
}

Explanation of the regex:

.{1,299}                 any character between 1 and 299 times (matching the most amount possible)
(?:[.!?]\\s+|\\n|$)      followed by either .!? and whitespaces, a newline or the end of the string

You could consider to extend the punctuation to \\p{Punct} , see javadoc for Pattern .

You can see a working sample on ideone .

The Unicode standard defines how you should break text into sentences and other logical components. Here's some working pseudocode:

// tests two consecutive codepoints within the text to detect the end of sentences
boolean continueSentence(Text text, Range range1, Range range2) {
    Code code1 = text.code(range1), code2 = text.code(range2);

    // 0.2  sot ÷   
    if (code1.isStartOfText())
        return false;

    // 0.3      ÷    eot
    if (code2.isEndOfText())
        return false;

    // 3.0  CR  ×    LF
    if (code1.isCR() && code2.isLF())
        return true;

    // 4.0  (Sep | CR | LF) ÷   
    if (code1.isSep() || code1.isCR() || code1.isLF())
        return false;

    // 5.0      ×    [Format Extend]
    if (code2.isFormat() || code2.isExtend())
        return true;

    // 6.0  ATerm   ×    Numeric
    if (code1.isATerm() && (code2.isDigit() || code2.isDecimal() || code2.isNumeric()))
        return true;

    // 7.0  Upper ATerm ×    Upper
    if (code2.isUppercase() && code1.isATerm()) {
        Range range = text.previousCode(range1);
        if (range.isValid() && text.code(range).isUppercase())
            return true;
    }

    boolean allow_STerm = true, return_value = true;

    // 8.0  ATerm Close* Sp*    ×    [^ OLetter Upper Lower Sep CR LF STerm ATerm]* Lower
    Range range = range2;
    Code code = code2;
    while (!code.isOLetter() && !code.isUppercase() && !code.isLowercase() && !code.isSep() && !code.isCR() && !code.isLF() && !code.isSTerm() && !code.isATerm()) {
        if (!(range = text.nextCode(range)).isValid())
            break;
        code = text.code(range);
    }
    range = range1;
    if (code.isLowercase()) {
        code = code1;
        allow_STerm = true;
        goto Sp_Close_ATerm;
    }
    code = code1;

    // 8.1  (STerm | ATerm) Close* Sp*  ×    (SContinue | STerm | ATerm)
    if (code2.isSContinue() || code2.isSTerm() || code2.isATerm())
        goto Sp_Close_ATerm;

    // 9.0  ( STerm | ATerm ) Close*    ×    ( Close | Sp | Sep | CR | LF )
    if (code2.isClose())
        goto Close_ATerm;

    // 10.0 ( STerm | ATerm ) Close* Sp*    ×    ( Sp | Sep | CR | LF )
    if (code2.isSp() || code2.isSep() || code2.isCR() || code2.isLF())
        goto Sp_Close_ATerm;

    // 11.0 ( STerm | ATerm ) Close* Sp* (Sep | CR | LF)?   ÷   
    return_value = false;

    // allow Sep, CR, or LF zero or one times
    for (int iteration = 1; iteration != 0; iteration--) {
        if (!code.isSep() && !code.isCR() && !code.isLF()) goto Sp_Close_ATerm;
        if (!(range = text.previousCode(range)).isValid()) goto Sp_Close_ATerm;
        code = text.code(range);
    }

Sp_Close_ATerm:
    // allow zero or more Sp
    while (code.isSp() && (range = text.previousCode(range)).isValid())
        code = text.code(range);

Close_ATerm:
    // allow zero or more Close
    while (code.isClose() && (range = text.previousCode(range)).isValid())
        code = text.code(range);

    // require STerm or ATerm
    if (code.isATerm() || (allow_STerm && code.isSTerm()))
        return return_value;

    // 12.0     ×    Any
    return true;
}

Then you can iterate over the sentences like so:

// pass in a range of (0, 0) to get the range of the first sentence
// returns a range with a length of 0 if there are no more sentences
Range nextSentence(Text text, Range range) {
try_again:
    range = text.nextCode(new Range(range.start + range.length, 0));
    if (!range.isValid())
        return range;
    Range next = text.nextCode(range);
    long start = range.start;
    while (next.isValid()) && text.continueSentence(range, next))
        next = text.nextCode(range = next);
    range = new Range(start, range.start + range.length - start);

    Range range2 = text.trimRange(range);
    if (!range2.isValid())
        goto try_again;

    return range2;
}

Where:

  • Range is defined as a range from >= start and < start + length
  • text.trimRange removes the whitespace characters (optional)
  • all of the Code.is[Type] functions are lookups into the Unicode character database . For example, you'll see in some of those files that some codepoints are defined as "CR", "Sep", "StartOfText", etc.
  • Text.code(range) decodes the codepoint in the text at range.start. The length is not used.
  • Text.nextCode and Text.previousCode return the range of the next or previous codepoint within the string, based on the range of the current codepoint. If there is no codepoint in that direction, it returns an invalid range, which is a range with a length of 0.

The standard also defines ways to iterate over words , lines , and characters .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM