[英]Java jsoup link ignore
I have the following code: 我有以下代码:
private static final Pattern FILE_FILTER = Pattern.compile(
".*(\\.(css|js|bmp|gif|jpe?g|png|tiff?|mid|mp2|mp3|mp4|wav|avi|mov|mpeg|ram|m4v|pdf" +
"|rm|smil|wmv|swf|wma|zip|rar|gz))$");
private boolean isRelevant(String url) {
if (url.length() < 1) // Remove empty urls
return false;
else if (FILE_FILTER.matcher(url).matches()) {
return false;
}
else
return TLSpecific.isRelevant(url);
}
I am using this part when i am parsing a web site to check whether it contains links that contains some of the patterns declared, but I dont know is there a way to do it directly through jsoup and optimize the code. 我在解析一个网站时使用这个部分来检查它是否包含包含一些声明的模式的链接,但我不知道有没有办法通过jsoup直接进行并优化代码。 For example given a web page how I can ignore all of them with jsoup? 例如,给定一个网页我怎么能用jsoup忽略所有这些?
how I can ignore all of them with jsoup? 如何用jsoup忽略所有这些?
Let's say we want any element not having jpg or jpeg extension in their href
or src
attribute. 假设我们希望任何元素在其href
或src
属性中没有jpg或jpeg扩展名。
String filteredLinksCssQuery = "[href]:not([href~=(?i)\\.jpe?g$]), " + //
"[src]:not([src~=(?i)\\.jpe?g$])";
String html = "<a href='foo.jpg'>foo</a>" + //
"<a href='bar.svg'>bar</a>" + //
"<script src='baz.js'></script>";
Document doc = Jsoup.parse(html);
for(Element e: doc.select(filteredLinksCssQuery)) {
System.out.println(e);
}
<a href="bar.svg">bar</a>
<script src="baz.js"></script>
[href] /* Select any element having an href attribute... */
:not([href~=(?i)\.jpe?g$]) /* ... but exclude those matching the regex (?i)\.jpe?g$ */
, /* OR */
[src] /* Select any element having a src attribute... */
:not([src~=(?i)\.jpe?g$]) /* ... but exclude those matching the regex (?i)\.jpe?g$ */
You can add more extensions to filter. 您可以添加更多扩展名以进行过滤。 You may want to write some code for generating filteredLinksCssQuery
automatically because this CSS query can quickly become unmaintainable. 您可能希望编写一些代码来自动生成filteredLinksCssQuery
因为此CSS查询很快就会变得不可维护。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.