简体   繁体   中英

divide sentence into words and punctuations

I need to parse class Sentence into word and punctuation (whitespace is considered as a punctuation mark), then add all of it into general ArrayList<Sentence> .

An example sentence:

A man, a plan, a canal — Panama!
A => word
whitespase => punctuation
man => word
, + space => punctuation
a => word
[...]

I tried to read this whole sentence one character at a time and collect the same and create new word or new Punctuation from this collection.

Here's my code:

public class Sentence {

    private String sentence;
    private LinkedList<SentenceElement> elements;

    /**
     * Constructs a sentence.
     * @param aText a string containing all characters of the sentence
     */
    public Sentence(String aText) {
        sentence = aText.trim();
        splitSentence();
    }

    public String getSentence() {
        return sentence;
    }

    public LinkedList<SentenceElement> getElements() {
        return elements;
    }

    /**
     * Split sentance into words and punctuations
     */
    private void splitSentence() {
        if (sentence == "" || sentence == null || sentence == "\n") {
            return;
        }

        StringBuilder builder = new StringBuilder();

        int j = 0;
        boolean mark = false;
        while (j < sentence.length()) {
            //char current = sentence.charAt(j);

            while (Character.isLetter(sentence.charAt(j))) {
                if (mark) {
                    elements.add(new Punctuation(builder.toString()));
                    builder.setLength(0);
                    mark = false;
                }
                builder.append(sentence.charAt(j));
                j++;
            } 
            mark = true;

            while (!Character.isLetter(sentence.charAt(j))) {
                if (mark) {
                    elements.add(new Word(builder.toString()));
                    builder.setLength(0);
                    mark = false;
                }
                builder.append(sentence.charAt(j));
                j++;
            }
            mark = true;
        }
    }

But logic of splitSentence() isn't work correctly. And I can't to find right solution for it.

I want to implement this as we read first character => add to builder => till next element are the same type (letter or punctuation) keep adding to builder => when next element are different than content of builder => create new word or punctuation and set builder to start.

Do the same logic again.

How to implement this checking logic at right way?

Split the string on word boundaries (except the first):

String[] parts = sentence.split("(?<!^)\\b");

The array will contain alternating word/punctuation/word/punctuation/word etc.


Here's some test code:

String sentence = "A man, a plan, a canal — Panama!";
String[] parts = sentence.split("(?<!^)\\b");
for (String part : parts)
    System.out.println('"' + part + "\" (" + (part.matches("\\w+") ? "word" : "punctuation") + ")");

Output:

"A" (word)
" " (punctuation)
"man" (word)
", " (punctuation)
"a" (word)
" " (punctuation)
"plan" (word)
", " (punctuation)
"a" (word)
" " (punctuation)
"canal" (word)
" — " (punctuation)
"Panama" (word)
"!" (punctuation)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM