[英]Java extracting substring from sentences
有像是,不是,不包含的單詞組合。 我們必須在一個句子中匹配這些單詞並進行拆分。
輸入 : if name is tom and age is not 45 or name does not contain tom then let me know.
預期產量:
If name is
tom and age is not
45 or name does not contain
tom then let me know
我在下面的代碼中嘗試拆分和提取,但是“ is”的出現也在“ is not”中,而我的代碼無法找出:
public static void loadOperators(){
operators.add("is");
operators.add("is not");
operators.add("does not contain");
}
public static void main(String[] args) {
loadOperators();
for(String s : operators){
System.out.println(str.split(s).length - 1);
}
}
因為有可能是一個詞的多個occurence split
不會解決你的使用情況,在is
與is not
正在為你不同的運營商。 您最好是:
Iterate :
1. Find the index of the 'operator'.
2. Search for the next space _ or word.
3. Then update your string as substring from its index to length-1.
對於您要實現的目標,我並不完全確定,但是讓我們嘗試一下。
對於您的情況,一個簡單的“解決方法”可能就可以了:按長度排序,對運算符進行降序排序。 這樣,將首先找到“最大匹配項”。 您可以定義“最大”的字面中最長的字符串,或字(含空格數)的最好的數量,因此is a
具有優先於contains
不過,您需要確保沒有比賽重疊,這可以通過比較所有比賽的開始索引和結束索引並按照某些條件(例如首場比賽獲勝)丟棄重疊來完成
這段代碼完成了您似乎想做的事情(或者我猜到您想做的事情):
public static void main(String[] args) {
List<String> operators = new ArrayList<>();
operators.add("is");
operators.add("is not");
operators.add("does not contain");
String input = "if name is tom and age is not 45 or name does not contain tom then let me know.";
List<String> output = new ArrayList<>();
int lastFoundOperatorsEndIndex = 0; // First start at the beginning of input
for (String operator : operators){
int indexOfOperator = input.indexOf(operator); // Find current operator's position
if (indexOfOperator > -1) { // If operator was found
int thisOperatorsEndIndex = indexOfOperator + operator.length(); // Get length of operator and add it to the index to include operator
output.add(input.substring(lastFoundOperatorsEndIndex, thisOperatorsEndIndex).trim()); // Add operator to output (and remove trailing space)
lastFoundOperatorsEndIndex = thisOperatorsEndIndex; // Update startindex for next operator
}
}
output.add(input.substring(lastFoundOperatorsEndIndex, input.length()).trim()); // Add rest of input as last entry to output
for (String part : output) { // Output to console
System.out.println(part);
}
}
但是它高度依賴於句子的順序和運算符。 如果我們談論的用戶輸入,任務將更加復雜。
使用正則表達式(regExp)的更好方法是:
public static void main(String... args) {
// Define inputs
String input1 = "if name is tom and age is not 45 or name does not contain tom then let me know.";
String input2 = "the name is tom and he is 22 years old but the name does not contain jack, but merry is 24 year old.";
// Output split strings
for (String part : split(input1)) {
System.out.println(part.trim());
}
System.out.println();
for (String part : split(input2)) {
System.out.println(part.trim());
}
}
private static String[] split(String input) {
// Define list of operators - 'is not' has to precede 'is'!!
String[] operators = { "\\sis not\\s", "\\sis\\s", "\\sdoes not contain\\s", "\\sdoes contain\\s" };
// Concatenate operators to regExp-String for search
StringBuilder searchString = new StringBuilder();
for (String operator : operators) {
if (searchString.length() > 0) {
searchString.append("|");
}
searchString.append(operator);
}
// Replace all operators by operator+\n and split resulting string at \n-character
return input.replaceAll("(" + searchString.toString() + ")", "$1\n").split("\n");
}
注意操作員的順序! “是”必須在 “不是”或“不是”之后被分割。
您可以通過為運算符“ is”使用負前瞻來防止這種情況。 因此, "\\\\sis\\\\s"
將變為"\\\\sis(?! not)\\\\s"
(讀為:“是”,而不是“ not”)。
極簡版本(使用JDK 1.6+)可能如下所示:
private static String[] split(String input) {
String[] operators = { "\\sis(?! not)\\s", "\\sis not\\s", "\\sdoes not contain\\s", "\\sdoes contain\\s" };
return input.replaceAll("(" + String.join("|", operators) + ")", "$1\n").split("\n");
}
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.