简体   繁体   English

Java,正则表达式和匹配器

[英]java, regular expressions, & matcher

I've got a friend who had this working at one point in time. 我有一个朋友在某个时间点上完成了这项工作。 In learning regular expressions, I don't understand why it would have a / as the sandbox testers balk at it. 在学习正则表达式时,我不明白为什么沙盒测试人员会反对它为什么会有一个/。

private static final Pattern SUB_URL_PATTERN = Pattern.compile("href=\"(/*\\w*/*\\w*/\\d+.html)\">",Pattern.CASE_INSENSITIVE | Pattern.DOTALL);

What is the / in the above regex pattern trying to do? 上面的正则表达式模式中的/试图做什么? This pattern is broke and I'm not sure how to fix. 此模式已损坏,我不确定如何解决。

This is how it comes out in the debugger: 这是在调试器中显示出来的方式:

href="(/*\w*/*\w*/\d+.html)">

Is this how the regex would break down? 正则表达式会这样分解吗?

href="     ... matches href="
/*         ... matches 0 or more occurrences of /   
\w*        ... matches 0 or more occurrences of word characters   
/*         ... matches 0 or more occurrences of /   
\w*        ... matches 0 or more occurrences of word characters   
/          ... matches a /  
\d+        ... matches one or several digits   
.html)">   ... matches /html

Here is the snippet of webpage source that it should hitting on to capture href="/reo/4890530477.html": 这是网页源的片段,应该捕捉到href =“ / reo / 4890530477.html”:

<a href="/reo/4890530477.html" class="i" data-ids="0:00j0j_jDfSzBcGgid"></a> 
final Pattern SUB_URL_PATTERN = Pattern.compile("href=\"/\\w+/\\w+/\\d+\\.html\"")

should match 应该匹配

href="/[word]/[word]/[number].html"

You might want: 你可能想要:

final Pattern SUB_URL_PATTERN = Pattern.compile("href=\"(/\\w+)*/\\d+\\.html\"")

Which will match 哪个会匹配

href="[0+ groups of '/word']/[number].html"

With Java, you need to use two backslashes \\\\ to make a string that contains the backslash... for example, if you wanted to have a regex pattern of \\d you would need a string declared as "\\\\d" because the Java language uses the same escape character that the regexes do. 使用Java,您需要使用两个反斜杠\\\\来创建包含反斜杠的字符串...例如,如果要使用\\d的正则表达式模式,则需要将字符串声明为"\\\\d"因为Java语言使用与正则表达式相同的转义字符。

I highly recommend you take maybe an hour to go through the following free regex tutorial: 我强烈建议您大概花一个小时来阅读以下免费的正则表达式教程:

http://regexone.com/ http://regexone.com/

It's interactive and a piece of cake to get through. 它是交互式的,可以轻松解决。 When you finish I guarantee you'll understand them 100x better. 完成后,我保证您会更好地理解它们。

To second Jens, it's probably a better idea to use an html parser than to use regexes for this. 对于Jens而言,使用html解析器可能比使用正则表达式更好。 You might check out jsoup; 您可以查看jsoup; it's what I use. 这就是我用的

http://jsoup.org/ http://jsoup.org/

The character / does not have any special meaning in the Java 字符/在Java中没有任何特殊含义
regular expressions syntax/language. 正则表达式的语法/语言。 It is just that: the / literal. 就是这样: /文字。

The metacharacters supported by the Java RegExp API are: <([{\\^-=$!|]})?*+.> Java RegExp API支持的元字符是: <([{\\^-=$!|]})?*+.>

See here: http://docs.oracle.com/javase/tutorial/essential/regex/literals.html 参见此处: http : //docs.oracle.com/javase/tutorial/essential/regex/literals.html

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM