简体   繁体   English

正则表达式用于句子的开头和结尾

[英]Regex for start and end of sentence

Is there a way to match start and end of sentence in Java? 有没有办法在Java中匹配句子的开头结尾 The easiest case is ending with simple (.) dot. 最简单的情况是以简单的(。)点结束。 In some other cases it could end with colum (:) or a shortcut ended with colum (.:). 在其他一些情况下,它可能以colum(:)结束,或者一个快捷方式以colum(。:)结束。

For example some random news text: 例如,一些随机新闻文本:

Cliffs have collapsed in New Zealand during an earthquake in the city of Christchurch on the South Island. 在南岛克赖斯特彻奇市发生地震时,新西兰的悬崖坍塌了。 No serious damage or fatalities were reported in the Valentine's Day quake that struck at 13:13 local time. 在当地时间13:13发生的情人节地震中,没有发生严重的破坏或死亡的报告。 Based on the med. 基于医学。 report everybody were ok. 报告大家都还好。

My goal is to get the shortcut of a word + the context of it, but if possible only the sentence in which the shortcut belonds. 我的目标是获取单词的快捷方式及其上下文,但如果可能的话,仅获取该快捷方式所代表的句子。

So the successfull output for me will be if I would be able to get something like this: 因此,对我来说,成功的输出将是我能够得到以下信息:

selected word -> collapsed 选择的单词 ->折叠

context -> Cliffs have collapsed in New Zealand during an earthquake in the city of Christchurch on the South Island. 上下文 ->在南岛克赖斯特彻奇市发生地震时,悬崖在新西兰倒塌了。

selected word -> med. 选择的单词 ->中。

context -> Based on the med. 上下文 ->基于med。 report everybody were ok. 报告大家都还好。

Thanks 谢谢

what you are looking for is a natural language processing toolkit. 您正在寻找的是自然语言处理工具包。 for java you can use: CoreNLP and they already have some example cases on their tutorials page. 对于Java,您可以使用: CoreNLP,并且他们的教程页面上已经有一些示例案例。 you can certainly make a regex expression that looks for all chars inbetween the set of chars (.:? etc...), and it would look something like this: 您当然可以制作一个正则表达式来查找字符集(。:?等...)之间的所有字符,它看起来像这样:

\.*?(?=[\.\:])\

then you would have to loop through the matched results and find the relevant sentences which have your words in them. 那么您将不得不遍历匹配的结果,并找到其中包含您的单词的相关句子。 but i recommend you use a NLP to achieve this. 但我建议您使用NLP来实现。

The code: 编码:

import java.util.HashMap;
import java.util.Map;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class Main {

   public static void main( String[] args ) {
      final Map<String, String> dict = new HashMap<>();
      dict.put( "med", "medical" );
      final String text =
         "Cliffs have collapsed in New Zealand during an earthquake in the "
         + "city of Christchurch on the South Island. No serious damage or "
         + "fatalities were reported in the Valentine's Day quake that struck "
         + "at 13:13 local time. Based on the med. report everybody were ok.";
      final Pattern p = Pattern.compile( "[^\\.]+\\W+(\\w+)\\." );
      final Matcher m = p.matcher( text );
      int pos = 0;
      while(( pos < text.length()) && m.find( pos )) {
         pos = m.end() + 1;
         final String word = m.group( 1 );
         if( dict.containsKey( word )) {
            final String repl            = dict.get( word );
            final String beginOfSentence = text.substring( m.start(), m.end());
            final String endOfSentence;
            if( m.find( pos )) {
               endOfSentence = text.substring( m.start() - 1, m.end());
            }
            else {
               endOfSentence = text.substring( m.start() - 1);
            }
            System.err.printf( "Replace '%s.' in '%s%s' with '%s'\n",
               word, beginOfSentence, endOfSentence, repl );
            final String sentence =
               ( beginOfSentence + endOfSentence ).replaceAll( word+'.', repl );
            System.err.println( sentence );
         }
      }
   }
}

The execution: 执行:

Replace 'med.' in 'Based on the med. report everybody were ok.' with 'medical'
Based on the medical report everybody were ok.

You spot the sentence easily. 您很容易发现句子。 It starts with a capital letter and ends with one of .:!? 它以大写字母开头,以.:!?结尾.:!? chars followed by space and another capital letter or reached the end of the whole string. 字符,后跟空格和另一个大写字母,或到达整个字符串的末尾。

Compare the difference time. Based 比较差异time. Based time. Based and med. report time. Basedmed. report med. report . med. report

So the regex capturing the whole sentence should look like this: 因此,捕获整个句子的正则表达式应如下所示:

([A-Z][a-z].*?[.:!?](?=$| [A-Z]))

Take a look! 看一看! Regex101 正则表达式101

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM