简体   繁体   中英

Java simple sentence parser

is there any simple way to create sentence parser in plain Java without adding any libs and jars.

Parser should not just take care about blanks between words, but be more smart and parse: . ! ?, recognize when sentence is ended etc.

After parsing, only real words could be all stored in db or file, not any special chars.

thank you very much all in advance :)

You might want to start by looking at the BreakIterator class.

From the JavaDoc.

The BreakIterator class implements methods for finding the location of boundaries in text. Instances of BreakIterator maintain a current position and scan over text returning the index of characters where boundaries occur. Internally, BreakIterator scans text using a CharacterIterator, and is thus able to scan text held by any object implementing that protocol. A StringCharacterIterator is used to scan String objects passed to setText.

You use the factory methods provided by this class to create instances of various types of break iterators. In particular, use getWordIterator, getLineIterator, getSentenceIterator, and getCharacterIterator to create BreakIterators that perform word, line, sentence, and character boundary analysis respectively. A single BreakIterator can work only on one unit (word, line, sentence, and so on). You must use a different iterator for each unit boundary analysis you wish to perform.

Line boundary analysis determines where a text string can be broken when line-wrapping. The mechanism correctly handles punctuation and hyphenated words.

Sentence boundary analysis allows selection with correct interpretation of periods within numbers and abbreviations, and trailing punctuation marks such as quotation marks and parentheses.

Word boundary analysis is used by search and replace functions, as well as within text editing applications that allow the user to select words with a double click. Word selection provides correct interpretation of punctuation marks within and following words. Characters that are not part of a word, such as symbols or punctuation marks, have word-breaks on both sides.

Character boundary analysis allows users to interact with characters as they expect to, for example, when moving the cursor through a text string. Character boundary analysis provides correct navigation of through character strings, regardless of how the character is stored. For example, an accented character might be stored as a base character and a diacritical mark. What users consider to be a character can differ between languages.

BreakIterator is intended for use with natural languages only. Do not use this class to tokenize a programming language.

See demo : BreakIteratorDemo.java

Based on @Jarrod Roberson's answer , I have created a util method that uses BreakIterator and returns the list of sentences.

public static List<String> tokenize(String text, String language, String country){
    List<String> sentences = new ArrayList<String>();
    Locale currentLocale = new Locale(language, country);
    BreakIterator sentenceIterator = BreakIterator.getSentenceInstance(currentLocale);      
    sentenceIterator.setText(text);
    int boundary = sentenceIterator.first();
    int lastBoundary = 0;
    while (boundary != BreakIterator.DONE) {
        boundary = sentenceIterator.next();         
        if(boundary != BreakIterator.DONE){
            sentences.add(text.substring(lastBoundary, boundary));
        }
        lastBoundary = boundary;            
    }
    return sentences;
}

Just use regular expression ( \\s+ - it will apply to one or more whitespaces (spaces, tabs, etc.)) to split String into array.

Then you may iterate over that array and check whether word ends with .?! ( String.endsWith() to find end of sentences.

And before saving any word use once again regular expression to remove every non-alphanumeric character.

Of course, use StringTokenizer

import java.util.StringTokenizer;

public class Token {
    public static void main(String[] args) {

        String sentence = "Java! simple ?sentence parser.";
        String separator = "!?.";

        StringTokenizer st = new StringTokenizer( sentence, separator, true );

        while ( st.hasMoreTokens() ) {
            String token = st.nextToken();
            if ( token.length() == 1 && separator.indexOf( token.charAt( 0 ) ) >= 0 ) {
                System.out.println( "special char:" + token );
            }
            else {
                System.out.println( "word :" + token );
            }

        }
    }
}

String Tokenizer

Scanner

Ex.

StringTokenizer tokenizer = new StringTokenizer(input, " !?.");

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM