I've got a friend who had this working at one point in time. In learning regular expressions, I don't understand why it would have a / as the sandbox testers balk at it.
private static final Pattern SUB_URL_PATTERN = Pattern.compile("href=\"(/*\\w*/*\\w*/\\d+.html)\">",Pattern.CASE_INSENSITIVE | Pattern.DOTALL);
What is the / in the above regex pattern trying to do? This pattern is broke and I'm not sure how to fix.
This is how it comes out in the debugger:
href="(/*\w*/*\w*/\d+.html)">
Is this how the regex would break down?
href=" ... matches href="
/* ... matches 0 or more occurrences of /
\w* ... matches 0 or more occurrences of word characters
/* ... matches 0 or more occurrences of /
\w* ... matches 0 or more occurrences of word characters
/ ... matches a /
\d+ ... matches one or several digits
.html)"> ... matches /html
Here is the snippet of webpage source that it should hitting on to capture href="/reo/4890530477.html":
<a href="/reo/4890530477.html" class="i" data-ids="0:00j0j_jDfSzBcGgid"></a>
final Pattern SUB_URL_PATTERN = Pattern.compile("href=\"/\\w+/\\w+/\\d+\\.html\"")
should match
href="/[word]/[word]/[number].html"
You might want:
final Pattern SUB_URL_PATTERN = Pattern.compile("href=\"(/\\w+)*/\\d+\\.html\"")
Which will match
href="[0+ groups of '/word']/[number].html"
With Java, you need to use two backslashes \\\\
to make a string that contains the backslash... for example, if you wanted to have a regex pattern of \\d
you would need a string declared as "\\\\d"
because the Java language uses the same escape character that the regexes do.
I highly recommend you take maybe an hour to go through the following free regex tutorial:
It's interactive and a piece of cake to get through. When you finish I guarantee you'll understand them 100x better.
To second Jens, it's probably a better idea to use an html parser than to use regexes for this. You might check out jsoup; it's what I use.
The character /
does not have any special meaning in the Java
regular expressions syntax/language. It is just that: the /
literal.
The metacharacters supported by the Java RegExp API are:
<([{\\^-=$!|]})?*+.>
See here: http://docs.oracle.com/javase/tutorial/essential/regex/literals.html
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.