简体   繁体   中英

Java: How to use pattern.matcher

I am trying to parse an html page using jsoup. I check the contentType of each element and want to print all elements which are not of type text/html. I am using pattern matching after getting the content type of the element. With the above code I see that elements of type text/html are getting printed

import java.io.*;
import java.net.*;
import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import org.apache.commons.validator.routines.UrlValidator;
import java.util.regex.Pattern;
import java.util.regex.Matcher;


public class HelloWorld {

    public static void main(String[] args) {
    String url;
    UrlValidator urlValidator = new UrlValidator();
    url = "https://www.google.com";
    Document doc = Jsoup.connect(url).get(); //parse the html code pointed by url
    Elements links = doc.select("a[href]");
        for (Element link : links) {
            if(urlValidator.isValid(link.attr("href"))) { //check if the element is a url
                URL portfolio_url = new URL(link.attr("href"));
                URLConnection c = portfolio_url.openConnection();
                String link_type = c.getContentType();
                System.out.println(link_type);
                if(link_type != null) {
                    Pattern pattern = Pattern.compile(link_type, Pattern.CASE_INSENSITIVE);  // case-insensitive matching
                    Matcher matcher = pattern.matcher("text/html");
                    if(matcher.find() != true) {
                        System.out.println(link.attr("href"));
                    }
                }
            }
        }
    }
}

You can to this using linux utilities Wget

wget -r www.mytargetsite.com

Then run below command which would reveal all urls,

 find www.mytargetsite.com 

here's the sample output

$ wget -r www.blackorange.biz
$ find www.blackorange.biz/
www.blackorange.biz/
www.blackorange.biz/services.html
www.blackorange.biz/contact.html
www.blackorange.biz/images
www.blackorange.biz/images/projectimg1.jpg

note: this also downloads all pages for you

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM