简体   繁体   中英

java, regular expressions, & matcher

I've got a friend who had this working at one point in time. In learning regular expressions, I don't understand why it would have a / as the sandbox testers balk at it.

private static final Pattern SUB_URL_PATTERN = Pattern.compile("href=\"(/*\\w*/*\\w*/\\d+.html)\">",Pattern.CASE_INSENSITIVE | Pattern.DOTALL);

What is the / in the above regex pattern trying to do? This pattern is broke and I'm not sure how to fix.

This is how it comes out in the debugger:

href="(/*\w*/*\w*/\d+.html)">

Is this how the regex would break down?

href="     ... matches href="
/*         ... matches 0 or more occurrences of /   
\w*        ... matches 0 or more occurrences of word characters   
/*         ... matches 0 or more occurrences of /   
\w*        ... matches 0 or more occurrences of word characters   
/          ... matches a /  
\d+        ... matches one or several digits   
.html)">   ... matches /html

Here is the snippet of webpage source that it should hitting on to capture href="/reo/4890530477.html":

<a href="/reo/4890530477.html" class="i" data-ids="0:00j0j_jDfSzBcGgid"></a> 
final Pattern SUB_URL_PATTERN = Pattern.compile("href=\"/\\w+/\\w+/\\d+\\.html\"")

should match

href="/[word]/[word]/[number].html"

You might want:

final Pattern SUB_URL_PATTERN = Pattern.compile("href=\"(/\\w+)*/\\d+\\.html\"")

Which will match

href="[0+ groups of '/word']/[number].html"

With Java, you need to use two backslashes \\\\ to make a string that contains the backslash... for example, if you wanted to have a regex pattern of \\d you would need a string declared as "\\\\d" because the Java language uses the same escape character that the regexes do.

I highly recommend you take maybe an hour to go through the following free regex tutorial:

http://regexone.com/

It's interactive and a piece of cake to get through. When you finish I guarantee you'll understand them 100x better.

To second Jens, it's probably a better idea to use an html parser than to use regexes for this. You might check out jsoup; it's what I use.

http://jsoup.org/

The character / does not have any special meaning in the Java
regular expressions syntax/language. It is just that: the / literal.

The metacharacters supported by the Java RegExp API are: <([{\\^-=$!|]})?*+.>

See here: http://docs.oracle.com/javase/tutorial/essential/regex/literals.html

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM