简体   繁体   English

使用正则表达式搜索字符串中的子字符串

[英]Search substring in a string using regex

I'm trying to search for a set of words, contained within an ArrayList(terms_1pers) , inside a string and, since the precondition is that before and after the search word there should be no letters, I thought of using expression regular. 我正在尝试搜索字符串中包含在ArrayList(terms_1pers)内的一组单词,并且由于前提是搜索单词之前和之后不应有字母,因此我想到了使用正则表达式。

I just don't know what I'm doing wrong using the matches operator. 我只是不知道我使用Matchs运算符在做什么错。 In the code reported, if the matching is not verified, it writes to an external file. 在报告的代码中,如果未验证匹配项,它将写入外部文件。

String url = csvRecord.get("url");
String text = csvRecord.get("review");
String var = null;
for(String term : terms_1pers)
{
   if(!text.matches("[^a-z]"+term+"[^a-z]"))
   {
      var="true";
   }
}
if(!var.equals("true"))
{
    bw.write(url+";"+text+"\n");
}

In order to find regex matches, you should use the regex classes. 为了找到正则表达式匹配项,您应该使用正则表达式类。 Pattern and Matcher. 模式和匹配器。

String term = "term";
ArrayList<String> a  = new ArrayList<String>();
a.add("123term456"); //true
a.add("A123Term5"); //false
a.add("term456"); //true
a.add("123term"); //true
Pattern p = Pattern.compile("^[^A-Za-z]*(" + term + ")[^A-Za-z]*$");
for(String text : a) {
    Matcher m = p.matcher(text);
    if (m.find()) {
         System.out.println("Found: " + m.group(1) );
         //since the term you are adding is the second matchable portion, you're looking for group(1)
    }
    else System.out.println("No match for: " + term);
}

} }

In the example there, we create an instance of a https://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html to find matches in the text you are matching against. 在该示例中,我们创建一个https://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html的实例,以在您要匹配的文本中查找匹配项。

Note that I adjusted the regex a bit. 请注意,我对正则表达式做了一些调整。 The choice in this code excludes all letters AZ and the lowercase versions from the initial matching part. 此代码中的选择从初始匹配部分中排除所有字母AZ和小写版本。 It will also allow for situations where there are no characters at all before or after the match term. 它还将允许在匹配项之前或之后根本没有字符的情况。 If you need to have something there, use + instead of * . 如果需要在此处放置某些内容,请使用+代替* I also limited the regex to force the match to only contain matches for these three groups by using ^ and $ to verify end the end of the matching text. 我还通过使用^$来验证正则表达式的结尾来限制正则表达式来强制匹配只包含这三个组的匹配。 If this doesn't fit your use case, you may need to adjust. 如果这不适合您的用例,则可能需要进行调整。

To demonstrate using this with a variety of different terms: 为了以各种不同的术语演示使用此方法:

ArrayList<String> terms = new ArrayList<String>();
terms.add("term");
terms.add("the book is on the table");
terms.add("1981 was the best year ever!");
ArrayList<String> a  = new ArrayList<String>();
a.add("123term456");
a.add("A123Term5");
a.add("the book is on the table456");
a.add("1@#!231981 was the best year ever!9#");
for (String term: terms) {

    Pattern p = Pattern.compile("^[^A-Za-z]*(" + term + ")[^A-Za-z]*$");

    for(String text : a) {

        Matcher m = p.matcher(text);
        if (m.find()) {
             System.out.println("Found: " + m.group(1)  + " in " + text);
             //since the term you are adding is the second matchable portion, you're looking for group(1)
        }
        else System.out.println("No match for: " + term + " in " + text);
    }
}

Output for this is: Found: term in 123term456 No match for: term in A123Term5 No match for: term in the book is on the table456.... 其输出为:找到:123term456中的术语匹配项:A123Term5中的术语不匹配:书中的术语在表456上。...

In response to the question about having String term being case insensitive, here's a way that we can build a string by taking advantage of java.lang.Character to options for upper and lower case letters. 回答有关使String术语不区分大小写的问题,这是一种我们可以通过利用java.lang.Character作为大小写字母选项的方式来构建字符串的方法。

String term = "This iS the teRm.";
String matchText = "123This is the term.";
StringBuilder str = new StringBuilder();
str.append("^[^A-Za-z]*(");
for (int i = 0; i < term.length(); i++) {
  char c = term.charAt(i);
  if (Character.isLetter(c))
    str.append("(" + Character.toLowerCase(c) + "|" + Character.toUpperCase(c) + ")");
  else str.append(c);
}
str.append(")[^A-Za-z]*$");

System.out.println(str.toString());


Pattern p = Pattern.compile(str.toString());
Matcher m = p.matcher(matchText);
if (m.find()) System.out.println("Found!");
else System.out.println("Not Found!");

This code outputs two lines, the first line is the regex string that's being compiled in the Pattern. 此代码输出两行,第一行是正在Pattern中编译的正则表达式字符串。 "^[^A-Za-z]*((t|T)(h|H)(i|I)(s|S) (i|I)(s|S) (t|T)(h|H)(e|E) (t|T)(e|E)(r|R)(m|M).)[^A-Za-z]*$" This adjusted regex allows for letters in the term to be matched regardless of case. "^[^A-Za-z]*((t|T)(h|H)(i|I)(s|S) (i|I)(s|S) (t|T)(h|H)(e|E) (t|T)(e|E)(r|R)(m|M).)[^A-Za-z]*$"此调整后的正则表达式允许在无论大小写都可以匹配。 The second output line is "Found!" 第二个输出行是“找到!” because the mixed case term is found within matchText. 因为在matchText中可以找到大小写混合的字词。

you did not consider the case where the start and end may contain letters so adding .* at the front and end should solve your problem. 您没有考虑开头和结尾可能包含字母的情况,因此在开头和结尾添加。*应该可以解决您的问题。

for(String term : terms_1pers)
{
   if( text.matches(".*[^a-zA-Z]+" + term + "[^a-zA-Z]+.*)" )  
   {
      var="true";
      break; //exit the loop
   }
}
if(!var.equals("true"))
{
    bw.write(url+";"+text+"\n");
}

There are several things to note: 有几件事要注意:

  • matches requires a full string match, so [^az]term[^az] will only match a string like :term. matches需要完整的字符串匹配,因此[^az]term[^az]仅匹配:term.类的字符串:term. . You need to use .find() to find partial matches 您需要使用.find()查找部分匹配项
  • If you pass a literal string to a regex, you need to Pattern.quote it, or if it contains special chars, it will not get matched 如果将文字字符串传递给正则表达式,则需要使用Pattern.quote对其进行Pattern.quote ,或者如果它包含特殊字符,则不会匹配它
  • To check if a word has some pattern before or after or at the start/end, you should either use alternations with anchors (like (?:^|[^az]) or (?:$|[^az]) ) or lookarounds, (?<![az]) and (?![az]) . 要检查单词之前或之后在开始/结束有一定的模式,您应该使用交替用锚(比如(?:^|[^az])(?:$|[^az])或周围环境(?<![az])(?![az])
  • To match any letter just use \\p{Alpha} or - if you plan to match any Unicode letter - \\p{L} . 要匹配任何字母,请使用\\p{Alpha}或者-如果您打算匹配任何Unicode字母,请使用\\p{L}
  • The var variable is more logical to set to Boolean type. var变量设置为布尔类型更具逻辑性。

Fixed code: 固定代码:

String url = csvRecord.get("url");
String text = csvRecord.get("review");
Boolean var = false;
for(String term : terms_1pers)
{
   Matcher m = Pattern.compile("(?<!\\p{L})" + Pattern.quote(term) + "(?!\\p{L})").matcher(text);
   // If the search must be case insensitive use
   // Matcher m = Pattern.compile("(?i)(?<!\\p{L})" + Pattern.quote(term) + "(?!\\p{L})").matcher(text); 
   if(!m.find())
   {
       var = true;
   }
}
if (!var) {
   bw.write(url+";"+text+"\n");
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM