I am trying to parse an html page using jsoup. I check the contentType of each element and want to print all elements which are not of type text/html. I am using pattern matching after getting the content type of the element. With the above code I see that elements of type text/html are getting printed
import java.io.*;
import java.net.*;
import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import org.apache.commons.validator.routines.UrlValidator;
import java.util.regex.Pattern;
import java.util.regex.Matcher;
public class HelloWorld {
public static void main(String[] args) {
String url;
UrlValidator urlValidator = new UrlValidator();
url = "https://www.google.com";
Document doc = Jsoup.connect(url).get(); //parse the html code pointed by url
Elements links = doc.select("a[href]");
for (Element link : links) {
if(urlValidator.isValid(link.attr("href"))) { //check if the element is a url
URL portfolio_url = new URL(link.attr("href"));
URLConnection c = portfolio_url.openConnection();
String link_type = c.getContentType();
System.out.println(link_type);
if(link_type != null) {
Pattern pattern = Pattern.compile(link_type, Pattern.CASE_INSENSITIVE); // case-insensitive matching
Matcher matcher = pattern.matcher("text/html");
if(matcher.find() != true) {
System.out.println(link.attr("href"));
}
}
}
}
}
}
You can to this using linux utilities Wget
wget -r www.mytargetsite.com
Then run below command which would reveal all urls,
find www.mytargetsite.com
here's the sample output
$ wget -r www.blackorange.biz
$ find www.blackorange.biz/
www.blackorange.biz/
www.blackorange.biz/services.html
www.blackorange.biz/contact.html
www.blackorange.biz/images
www.blackorange.biz/images/projectimg1.jpg
note: this also downloads all pages for you
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.