Java正则表达式：没有哈希的href

Question

I'm trying to build a sitemap and parsing the html bodies for href s that doesn't have # (as those with hashes are just sub chapter links in some content page htmls). 我正在尝试构建一个站点地图并解析没有# href的html主体（因为那些带有哈希的东西只是某些内容页面htmls中的子章节链接）。

My regexp now: <a\\\\s[^>]*href\\\\s*=\\\\s*\\"([^\\"]*)\\"[^>]*>(.*?)</a> I guess I should use [^#] or !# to exclude the # from href s but could not solve it with just trying and googling after it. Thanks in advance for helping me out! 我的regexp现在： <a\\\\s[^>]*href\\\\s*=\\\\s*\\"([^\\"]*)\\"[^>]*>(.*?)</a>我想我应该使用[^#]或!#从href排除# ，但只能通过尝试和谷歌搜索来解决它。感谢提前帮助我！

Answer 1

Done it. 完成了。 Just inserted the # too in the [^\\"] block. :D 刚刚在[^\\"]块中插入了#

<a\\s[^>]*href\\s*=\\s*\"([^\"#]*)\"[^>]*>(.*?)</a>

Answer 2

You should not use regex to parse HTML. 您不应该使用正则表达式来解析HTML。

Best use an HTML parser, as eg http://jsoup.org and then 最好使用HTML解析器，例如http://jsoup.org然后

Document doc = Jsoup.parse(input);
Elements links = doc.select("a[href]");

for (Element each: links) {
    if (each.attr("href").startsWith("#")) continue;
    ...
}

So much more painless than using regex, eh! 比使用正则表达式更加无痛，呃！

Java正则表达式：没有哈希的href

问题描述

2 个解决方案

解决方案1
1 已采纳 2012-12-07 07:26:11

解决方案2
1 2012-12-07 07:26:14

Java正则表达式：没有哈希的href

问题描述

2 个解决方案

解决方案1 1 已采纳 2012-12-07 07:26:11

解决方案2 1 2012-12-07 07:26:14

解决方案1
1 已采纳 2012-12-07 07:26:11

解决方案2
1 2012-12-07 07:26:14