简体   繁体   English

使用正则表达式

[英]Using Regular Expressions

I am having problems trying to use the regular expression that I used in JavaScript.我在尝试使用我在 JavaScript 中使用的正则表达式时遇到问题。 On a web page, you may have:在网页上,您可能有:

<b>Renewal Date:</b> 03 May 2010</td>

I just want to be able to pull out the 03 May 2010, remembering that a webpage has more than just the above content.我只是希望能够抽出 2010 年 5 月 3 日,记住一个网页不仅仅包含上述内容。 The way I currently perform this using JavaScript is:我目前使用 JavaScript 执行此操作的方式是:

DateStr = /<b>Renewal Date:<\/b>(.+?)<\/td>/.exec(returnedHTMLPage);

I tried to follow some tutorials on java.util.regex.Pattern and java.util.regex.Matcher with no luck.我试图按照一些关于java.util.regex.Patternjava.util.regex.Matcher教程进行操作,但没有成功。 I can't seem to be able to translate (.+?) into something they can understand??我似乎无法将(.+?)翻译成他们能理解的东西??

thanks,谢谢,

Noeneel诺内尔

This is how regular expressions are used in Java:这就是 Java 中正则表达式的使用方式:

Pattern p = Pattern.compile("<b>Renewal Date:</b>(.+?)</td>");
Matcher m = p.matcher(returnedHTMLPage);

if (m.find()) // find the next match (and "generate the groups")
    System.out.println(m.group(1)); // prints whatever the .+? expression matched.

There are other useful methods in the Matcher class, such as m.matches() . Matcher 类中还有其他有用的方法,例如m.matches() Have a look at Matcher .看看Matcher

On matches vs find matchesfind

The problem is that you used matches when you should've used find .问题是您在应该使用find时使用了matches From the API :API

  • The matches method attempts to match the entire input sequence against the pattern. matches方法尝试将整个输入序列与模式进行匹配。
  • The find method scans the input sequence looking for the next subsequence that matches the pattern. find方法扫描输入序列,寻找与模式匹配的下一个子序列。

Note that String.matches(String regex) also looks for a full match of the entire string.请注意, String.matches(String regex)还会查找整个字符串的完整匹配项。 Unfortunately String does not provide a partial regex match, but you can always s.matches(".*pattern.*") instead.不幸的是String不提供部分正则表达式匹配,但你总是可以s.matches(".*pattern.*")代替。


On reluctant quantifier关于不情愿的量词

Java understands (.+?) perfectly. Java 完全理解(.+?)

Here's a demonstration: you're given a string s that consists of a string t repeating at least twice.这是一个演示:给定一个字符串s ,它由至少重复两次的字符串t组成。 Find t .找到t

System.out.println("hahahaha".replaceAll("^(.+)\\1+$", "($1)"));
// prints "(haha)" -- greedy takes longest possible

System.out.println("hahahaha".replaceAll("^(.+?)\\1+$", "($1)"));
// prints "(ha)" -- reluctant takes shortest possible

On escaping metacharacters关于转义元字符

It should also be said that you have injected \\ into your regex ( "\\\\" as Java string literal) unnecessarily.还应该说您不必要地将\\注入了正则表达式( "\\\\"作为 Java 字符串文字)。

        String regexDate = "<b>Expiry Date:<\\/b>(.+?)<\\/td>";
                                            ^^         ^^
        Pattern p2 = Pattern.compile("<b>Expiry Date:<\\/b>");
                                                      ^^

\\ is used to escape regex metacharacters. \\用于转义正则表达式元字符。 A / is NOT a regex metacharacter. A /不是正则表达式元字符。

See also也可以看看

Ok, so using aioobe's original suggestion (which i also tried earlier), I have:好的,所以使用 aioobe 的原始建议(我之前也尝试过),我有:

String regexDate = "<b>Expiry Date:</b>(.+?)</td>";
Pattern p = Pattern.compile(regexDate);
Matcher m = p.matcher(returnedHTML);

if (m.matches()) // check if it matches (and "generate the groups")
{
  System.out.println("*******REGEX RESULT*******"); 
  System.out.println(m.group(1)); // prints whatever the .+? expression matched.
  System.out.println("*******REGEX RESULT*******"); 
}

The IF statement must keep coming up FALSE as the *******REGEX RESULT******* is never outputted. IF 语句必须不断出现 FALSE,因为 *******REGEX RESULT******* 永远不会输出。

If anyone missed what I am trying to achieve, I am just wanting to get the date out.如果有人错过了我想要实现的目标,我只是想确定日期。 Amongst a html page is a date like <b>Expiry Date:</b> 03 May 2010</td> and I want the 03 May 2010.在 html 页面中有一个类似<b>Expiry Date:</b> 03 May 2010</td> ,我想要 2010 年 5 月 3 日。

(.+?) is an odd choice. (.+?)是一个奇怪的选择。 Try ( *[0-9]+ *[A-Za-z]+ *[0-9]+ *) or just ([^<]+) instead.试试( *[0-9]+ *[A-Za-z]+ *[0-9]+ *)或者只是([^<]+)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM