简体   繁体   中英

Link extraction using owasp-java-html-sanitizer

I'm planning on using the owasp-java-html-sanitizer to perform a few tasks on user generated html.

I'd like to extract a list of the URLs from the html string.

I would also like to make sure all links have the target set to "_blank", this seems to be similar to the HtmlPolicyBuilder.requireRelNofollowOnLinks configuration. (done)

PolicyFactory linkRewrite = new HtmlPolicyBuilder().allowAttributes("href").onElements("a")
      .requireRelNofollowOnLinks().allowElements(new ElementPolicy() {
        public String apply(String elementName, List<String> attrs) {
          return "a";
      }, "a").toFactory();

This adds target="_blank" to links, not sure its the best way to accomplish it.

This also extracts the URLs:

.allowElements(new ElementPolicy() {
        public String apply(String elementName, List<String> attrs) {
          for (int i = 0, n = attrs.size(); i < n; i += 2) {
            if ("href".equals(attrs.get(i))) {
              urls.add(attrs.get(i + 1));
          return elementName;
      }, "a")
   new ElementPolicy() {
     public String apply(String elementName, List<String> attrs) {
       // Make sure that all links open in new windows/tabs without
       // using <base target> which also affects unsanitized links.
       return elementName;
   }, "a")
   new AttributePolicy() {
     public String apply(String elementName, String attributeName, String value) {
       // Collect all link URLs.
       return value;

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

粤ICP备18138465号  © 2020-2024 STACKOOM.COM