简体   繁体   English

Java简单句子解析器

[英]Java simple sentence parser

is there any simple way to create sentence parser in plain Java without adding any libs and jars. 有没有简单的方法在普通Java中创建句子解析器而不添加任何lib和jar。

Parser should not just take care about blanks between words, but be more smart and parse: . 解析器不应该只关注单词之间的空白,而应该更加智能和解析:。 ! ?, recognize when sentence is ended etc. ?,识别句子何时结束等

After parsing, only real words could be all stored in db or file, not any special chars. 解析后,只有真正的单词可以全部存储在db或文件中,而不是任何特殊的字符。

thank you very much all in advance :) 非常感谢你提前:)

You might want to start by looking at the BreakIterator class. 您可能希望从查看BreakIterator类开始。

From the JavaDoc. 来自JavaDoc。

The BreakIterator class implements methods for finding the location of boundaries in text. BreakIterator类实现了在文本中查找边界位置的方法。 Instances of BreakIterator maintain a current position and scan over text returning the index of characters where boundaries occur. BreakIterator的实例维护当前位置并扫描文本,返回出现边界的字符索引。 Internally, BreakIterator scans text using a CharacterIterator, and is thus able to scan text held by any object implementing that protocol. 在内部,BreakIterator使用CharacterIterator扫描文本,因此能够扫描实现该协议的任何对象所持有的文本。 A StringCharacterIterator is used to scan String objects passed to setText. StringCharacterIterator用于扫描传递给setText的String对象。

You use the factory methods provided by this class to create instances of various types of break iterators. 您可以使用此类提供的工厂方法来创建各种类型的break迭代器的实例。 In particular, use getWordIterator, getLineIterator, getSentenceIterator, and getCharacterIterator to create BreakIterators that perform word, line, sentence, and character boundary analysis respectively. 特别是,使用getWordIterator,getLineIterator,getSentenceIterator和getCharacterIterator来创建分别执行单词,行,句子和字符边界分析的BreakIterator。 A single BreakIterator can work only on one unit (word, line, sentence, and so on). 单个BreakIterator只能在一个单元(单词,行,句子等)上工作。 You must use a different iterator for each unit boundary analysis you wish to perform. 必须为要执行的每个单元边界分析使用不同的迭代器。

Line boundary analysis determines where a text string can be broken when line-wrapping. 行边界分析确定换行时文本字符串可以断开的位置。 The mechanism correctly handles punctuation and hyphenated words. 该机制正确处理标点符号和带连字符的单词。

Sentence boundary analysis allows selection with correct interpretation of periods within numbers and abbreviations, and trailing punctuation marks such as quotation marks and parentheses. 句子边界分析允许选择正确解释数字和缩写中的句点,以及跟踪标点符号,如引号和括号。

Word boundary analysis is used by search and replace functions, as well as within text editing applications that allow the user to select words with a double click. 字边界分析由搜索和替换功能以及文本编辑应用程序使用,允许用户通过双击选择单词。 Word selection provides correct interpretation of punctuation marks within and following words. 单词选择可以正确解释单词内和单词后面的标点符号。 Characters that are not part of a word, such as symbols or punctuation marks, have word-breaks on both sides. 不属于单词的字符(例如符号或标点符号)在两侧都有单词分隔符。

Character boundary analysis allows users to interact with characters as they expect to, for example, when moving the cursor through a text string. 字符边界分析允许用户按照他们期望的方式与字符交互,例如,当通过文本字符串移动光标时。 Character boundary analysis provides correct navigation of through character strings, regardless of how the character is stored. 无论字符的存储方式如何,字符边界分析都能提供正确的字符串导航。 For example, an accented character might be stored as a base character and a diacritical mark. 例如,重音字符可以存储为基本字符和变音符号。 What users consider to be a character can differ between languages. 用户认为角色的用语可能因语言而异。

BreakIterator is intended for use with natural languages only. BreakIterator仅适用于自然语言。 Do not use this class to tokenize a programming language. 不要使用此类来标记编程语言。

See demo : BreakIteratorDemo.java 请参阅演示BreakIteratorDemo.java

Based on @Jarrod Roberson's answer , I have created a util method that uses BreakIterator and returns the list of sentences. 根据@Jarrod Roberson的回答 ,我创建了一个使用BreakIterator并返回句子列表的util方法。

public static List<String> tokenize(String text, String language, String country){
    List<String> sentences = new ArrayList<String>();
    Locale currentLocale = new Locale(language, country);
    BreakIterator sentenceIterator = BreakIterator.getSentenceInstance(currentLocale);      
    sentenceIterator.setText(text);
    int boundary = sentenceIterator.first();
    int lastBoundary = 0;
    while (boundary != BreakIterator.DONE) {
        boundary = sentenceIterator.next();         
        if(boundary != BreakIterator.DONE){
            sentences.add(text.substring(lastBoundary, boundary));
        }
        lastBoundary = boundary;            
    }
    return sentences;
}

Just use regular expression ( \\s+ - it will apply to one or more whitespaces (spaces, tabs, etc.)) to split String into array. 只需使用正则表达式( \\s+ - 它将应用于一个或多个空格(空格,制表符等))将String拆分为数组。

Then you may iterate over that array and check whether word ends with .?! 然后你可以遍历那个数组并检查单词是否结束.?! ( String.endsWith() to find end of sentences. String.endsWith()查找句子的结尾。

And before saving any word use once again regular expression to remove every non-alphanumeric character. 在保存任何单词之前,再次使用正则表达式删除每个非字母数字字符。

Of course, use StringTokenizer 当然,使用StringTokenizer

import java.util.StringTokenizer;

public class Token {
    public static void main(String[] args) {

        String sentence = "Java! simple ?sentence parser.";
        String separator = "!?.";

        StringTokenizer st = new StringTokenizer( sentence, separator, true );

        while ( st.hasMoreTokens() ) {
            String token = st.nextToken();
            if ( token.length() == 1 && separator.indexOf( token.charAt( 0 ) ) >= 0 ) {
                System.out.println( "special char:" + token );
            }
            else {
                System.out.println( "word :" + token );
            }

        }
    }
}

String Tokenizer 字符串标记符

Scanner 扫描器

Ex. 防爆。

StringTokenizer tokenizer = new StringTokenizer(input, " !?.");

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM