將字符串拆分為不同字符類型的子字符串列表

Question

我正在編寫一個拼寫檢查器，它將文本文件作為輸入並輸出已糾正拼寫的文件。

該程序應保留格式和標點符號。

我想將輸入文本拆分為字符串標記的列表，以便每個標記為1或更多： word, punctuation, whitespace, or digit characters 。

例如：

輸入：

words.txt：

asdf不要]'。'..;''as12 .... asdf。
音頻文件

輸入為列表：

[“ asdf”，“”，“不要”，“”，“]'.... ;;”;“，”“，” as“，” 12“，” ....“，” asdf“ ，“。 ，“ \\ n”，“ asdf”]

諸如“ won't和“ i'll ”之類的單詞應被視為單個標記。

具有這種格式的數據將使我可以像這樣處理令牌：

String output = "";

for(String token : tokens) {
    if(isWord(token)) {
        if(!inDictionary(token)) {
            token = correctSpelling(token);
        }
    }
    output += token;
}

所以我的主要問題是如何如上所述將文本字符串拆分為子字符串列表？ 謝謝。

Answer 1

這里的主要困難是找到與您認為是“單詞”匹配的正則表達式。 在我的示例中，如果單詞以字母開頭或以下字符為字母，則我將其視為單詞的一部分：

public static void main(String[] args) {
        String in = "asdf don't ]'.'..;'' as12....asdf.\nasdf";

        //The pattern: 
        Pattern p = Pattern.compile("[\\p{Alpha}][\\p{Alpha}']*|'[\\p{Alpha}]+");

        Matcher m = p.matcher(in);
        //If you want to collect the words
        List<String> words = new ArrayList<String>();

        StringBuilder result = new StringBuilder();

        Now find something from the start
        int pos = 0; 
        while(m.find(pos)) {
            //Add everything from starting position to beginning of word
            result.append(in.substring(pos, m.start()));

            //Handle dictionary logig
            String token = m.group();
            words.add(token); //not used actually
            if(!inDictionary(token)) {
                token = correctSpelling(token);
            }
            //Add to result
            result.append(token);
            //Repeat from end position
            pos = m.end();
        }
        //Append remainder of input
        result.append(in.substring(pos));

        System.out.println("Result: " + result.toString());
    }

Answer 2

因為我喜歡解決難題，所以我嘗試了以下方法，並且認為效果很好：

public class MyTokenizer {
    private final String str;
    private int pos = 0;

    public MyTokenizer(String str) {
        this.str = str;
    }

    public boolean hasNext() {
        return pos < str.length();
    }

    public String next() {
        int type = getType(str.charAt(pos));
        StringBuilder sb = new StringBuilder();
        while(hasNext() && (str.charAt(pos) == '\'' || type == getType(str.charAt(pos)))) {
            sb.append(str.charAt(pos));
            pos++;
        }
        return sb.toString();
    }

    private int getType(char c) {
        String sc = Character.toString(c);
        if (sc.matches("\\d")) {
            return 0;
        }
        else if (sc.matches("\\w")) {
            return 1;
        }
        else if (sc.matches("\\s")) {
            return 2;
        }
        else if (sc.matches("\\p{Punct}")) {
            return 3;
        }
        else {
            return 4;
        }
    }

    public static void main(String... args) {
        MyTokenizer mt = new MyTokenizer("asdf don't ]'.'..;'' as12....asdf.\nasdf");
        while(mt.hasNext()) {
            System.out.println(mt.next());
        }
    }
}

將字符串拆分為不同字符類型的子字符串列表

問題描述

2 個解決方案

解決方案1
0 已采納 2015-11-02 16:11:59

解決方案2
0 2015-11-02 16:26:43

將字符串拆分為不同字符類型的子字符串列表

問題描述

2 個解決方案

解決方案1 0 已采納 2015-11-02 16:11:59

解決方案2 0 2015-11-02 16:26:43

解決方案1
0 已采納 2015-11-02 16:11:59

解決方案2
0 2015-11-02 16:26:43