[英]Java regex to split words in a sentences with value and its metric as single word
我正在嘗試將一個句子拆分為一組單詞。 我正在尋找的是在對數字進行分塊時還要考慮指標。
E.g (Made-up).
document= The root cause of the problem is the temperature, it is currently 40 degrees which is 30 percent likely to turn into an infection doctor has prescribed 1-19666 tablet which contains 1.67 gpm and has advised to consume them every 3 hrs.
所需要的是一組單詞
the
root
cause
problem
...
40 degrees
30 percent
1.67 gpm
1-19666 tablet
3 hrs
我試過的是
List<String> bagOfWords = new ArrayList<>();
String [] words = StringUtils.normalizeSpace(document.replaceAll("[^0-9a-zA-Z_.-]", " ")).split(" ");
for(String word :words){
bagOfWords.add(StringUtils.normalizeSpace(word.replaceAll("\\.(?!\\d)", " ")));
}
System.out.println("NEW 2 :: " + bagOfWords.toString());
假設一個包含數字的單詞之后是另一個不包含數字的單詞。 然后是以下代碼:
private static final String DOC = "The root cause of the problem is the temperature, it is currently 40 degrees which is 30 percent likely to turn into an infection doctor has prescribed 1-19666 tablet which contains 1.67 gpm and has advised to consume them every 3 hrs";
// ...
Pattern pattern = Pattern.compile("(\\b\\S*\\d\\S*\\b\\s+)?\\b\\S+\\b");
Matcher matcher = pattern.matcher(DOC);
List<String> words = new ArrayList<>();
while (matcher.find()) {
words.add(matcher.group());
}
for (String word : words) {
System.out.println(word);
}
說明:
\\\\b
查找單詞邊界 \\\\S
是非空格字符。 因此,您可以在一個單詞中包含所有內容,例如點或逗號。 (...)?
是第一個可選部分。 它用數字捕獲單詞(如果有)。 即它有一些字符( \\\\S*
),然后有一個數字( \\\\d
),然后又有一些字符( \\\\S*
) S
后面有+
,而不是*
。 您所質疑的范圍有點大,但是這里有一個技巧可以適用於這種格式的大多數句子。
首先,您可以創建一個前綴列表,其中包含您單位的關鍵字,例如hrs, tablet, gpm ...
一旦有了這些,您就很容易挑選所需的內容。
String document= "The root cause of the problem is the temperature, it is currently 40 degrees which is 30 percent likely to turn into an infection doctor has prescribed 1-19666 tablet which contains 1.67 gpm and has advised to consume them every 3 hrs.";
if(document.endsWith(".")){
document = document.substring(0, document.length() -1 );
}
System.out.println(document);
String[] splitted = document.split(" ");
List<String> keywords = new ArrayList();
keywords.add("degrees");
keywords.add("percent");
keywords.add("gpm");
keywords.add("tablet");
keywords.add("hrs");
List<String> words = new ArrayList();
for(String s : splitted){
if(!s.equals(",")){
//if s is not a comma;
if(keywords.contains(s) && words.size()!=0){
//if s is a keyword append to last item in list
int lastIndex = words.size()-1;
words.set(lastIndex, words.get(lastIndex)+" "+s);
}
else{
words.add(s);
}
}
}
for(String s : words){
System.out.println(s);
}
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.