簡體   English   中英

Java從句子中提取子串

[英]Java extracting substring from sentences

有像是,不是,不包含的單詞組合。 我們必須在一個句子中匹配這些單詞並進行拆分。

輸入if name is tom and age is not 45 or name does not contain tom then let me know.

預期產量:

If name is 
tom and age is not 
45 or name does not contain 
tom then let me know

我在下面的代碼中嘗試拆分和提取,但是“ is”的出現也在“ is not”中,而我的代碼無法找出:

public static void loadOperators(){
        operators.add("is");
        operators.add("is not");
        operators.add("does not contain");
    }

public static void main(String[] args) {
    loadOperators();
    for(String s : operators){
        System.out.println(str.split(s).length - 1);
    }
}

因為有可能是一個詞的多個occurence split不會解決你的使用情況,在isis not正在為你不同的運營商。 您最好是:

Iterate :
1. Find the index of the 'operator'.
2. Search for the next space _ or word.
3. Then update your string as substring from its index to length-1.

對於您要實現的目標,我並不完全確定,但是讓我們嘗試一下。

對於您的情況,一個簡單的“解決方法”可能就可以了:按長度排序,對運算符進行降序排序。 這樣,將首先找到“最大匹配項”。 您可以定義“最大”的字面中最長的字符串,或字(含空格數)的最好的數量,因此is a具有優先於contains

不過,您需要確保沒有比賽重疊,這可以通過比較所有比賽的開始索引和結束索引並按照某些條件(例如首場比賽獲勝)丟棄重疊來完成

這段代碼完成了您似乎想做的事情(或者我猜到您想做的事情):

public static void main(String[] args) {
    List<String> operators = new ArrayList<>();
    operators.add("is");
    operators.add("is not");
    operators.add("does not contain");

    String input = "if name is tom and age is not 45 or name does not contain tom then let me know.";
    List<String> output = new ArrayList<>();

    int lastFoundOperatorsEndIndex = 0; // First start at the beginning of input

    for (String operator : operators){
        int indexOfOperator = input.indexOf(operator); // Find current operator's position

        if (indexOfOperator > -1) { // If operator was found
            int thisOperatorsEndIndex = indexOfOperator + operator.length(); // Get length of operator and add it to the index to include operator
            output.add(input.substring(lastFoundOperatorsEndIndex, thisOperatorsEndIndex).trim()); // Add operator to output (and remove trailing space)
            lastFoundOperatorsEndIndex = thisOperatorsEndIndex; // Update startindex for next operator
        }
    }
    output.add(input.substring(lastFoundOperatorsEndIndex, input.length()).trim()); // Add rest of input as last entry to output

    for (String part : output) { // Output to console
        System.out.println(part);
    }
}

但是它高度依賴於句子的順序和運算符。 如果我們談論的用戶輸入,任務將更加復雜。

使用正則表達式(regExp)的更好方法是:

public static void main(String... args) {
    // Define inputs
    String input1 = "if name is tom and age is not 45 or name does not contain tom then let me know.";
    String input2 = "the name is tom and he is 22 years old but the name does not contain jack, but merry is 24 year old.";

    // Output split strings
    for (String part : split(input1)) {
        System.out.println(part.trim());
    }

    System.out.println();

    for (String part : split(input2)) {
        System.out.println(part.trim());
    }
}

private static String[] split(String input) {
    // Define list of operators - 'is not' has to precede 'is'!!
    String[] operators = { "\\sis not\\s", "\\sis\\s", "\\sdoes not contain\\s", "\\sdoes contain\\s" };

    // Concatenate operators to regExp-String for search
    StringBuilder searchString = new StringBuilder();

    for (String operator : operators) {
        if (searchString.length() > 0) {
            searchString.append("|");
        }
        searchString.append(operator);
    }

    // Replace all operators by operator+\n and split resulting string at \n-character
    return input.replaceAll("(" + searchString.toString() + ")", "$1\n").split("\n");
}

注意操作員的順序! “是”必須 “不是”或“不是”之后被分割。

您可以通過為運算符“ is”使用負前瞻來防止這種情況。 因此, "\\\\sis\\\\s"將變為"\\\\sis(?! not)\\\\s" (讀為:“是”,而不是“ not”)。

極簡版本(使用JDK 1.6+)可能如下所示:

private static String[] split(String input) {
    String[] operators = { "\\sis(?! not)\\s", "\\sis not\\s", "\\sdoes not contain\\s", "\\sdoes contain\\s" };
    return input.replaceAll("(" + String.join("|", operators) + ")", "$1\n").split("\n");
}

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM