简体   繁体   English

Java正则表达式:没有哈希的href

[英]Java Regular Expression: href without hash

I'm trying to build a sitemap and parsing the html bodies for href s that doesn't have # (as those with hashes are just sub chapter links in some content page htmls). 我正在尝试构建一个站点地图并解析没有# href的html主体(因为那些带有哈希的东西只是某些内容页面htmls中的子章节链接)。

My regexp now: <a\\\\s[^>]*href\\\\s*=\\\\s*\\"([^\\"]*)\\"[^>]*>(.*?)</a> I guess I should use [^#] or !# to exclude the # from href s but could not solve it with just trying and googling after it. Thanks in advance for helping me out! 我的regexp现在: <a\\\\s[^>]*href\\\\s*=\\\\s*\\"([^\\"]*)\\"[^>]*>(.*?)</a>我想我应该使用[^#]!#href排除# ,但只能通过尝试和谷歌搜索来解决它。感谢提前帮助我!

Done it. 完成了。 Just inserted the # too in the [^\\"] block. :D 刚刚在[^\\"]块中插入了#

<a\\s[^>]*href\\s*=\\s*\"([^\"#]*)\"[^>]*>(.*?)</a>

You should not use regex to parse HTML. 您不应该使用正则表达式来解析HTML。

Best use an HTML parser, as eg http://jsoup.org and then 最好使用HTML解析器,例如http://jsoup.org然后

Document doc = Jsoup.parse(input);
Elements links = doc.select("a[href]");

for (Element each: links) {
    if (each.attr("href").startsWith("#")) continue;
    ...
}

So much more painless than using regex, eh! 比使用正则表达式更加无痛,呃!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM