[英]Java Regular Expression: href without hash
I'm trying to build a sitemap and parsing the html bodies for href
s that doesn't have #
(as those with hashes are just sub chapter links in some content page htmls). 我正在尝试构建一个站点地图并解析没有#
href
的html主体(因为那些带有哈希的东西只是某些内容页面htmls中的子章节链接)。
My regexp now: <a\\\\s[^>]*href\\\\s*=\\\\s*\\"([^\\"]*)\\"[^>]*>(.*?)</a>
I guess I should use [^#]
or !#
to exclude the #
from href
s but could not solve it with just trying and googling after it. Thanks in advance for helping me out! 我的regexp现在: <a\\\\s[^>]*href\\\\s*=\\\\s*\\"([^\\"]*)\\"[^>]*>(.*?)</a>
我想我应该使用[^#]
或!#
从href
排除#
,但只能通过尝试和谷歌搜索来解决它。感谢提前帮助我!
Done it. 完成了。 Just inserted the #
too in the [^\\"]
block. :D 刚刚在[^\\"]
块中插入了#
<a\\s[^>]*href\\s*=\\s*\"([^\"#]*)\"[^>]*>(.*?)</a>
You should not use regex to parse HTML. 您不应该使用正则表达式来解析HTML。
Best use an HTML parser, as eg http://jsoup.org and then 最好使用HTML解析器,例如http://jsoup.org然后
Document doc = Jsoup.parse(input);
Elements links = doc.select("a[href]");
for (Element each: links) {
if (each.attr("href").startsWith("#")) continue;
...
}
So much more painless than using regex, eh! 比使用正则表达式更加无痛,呃!
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.