简体   繁体   中英

Web Scraping with Jsoup only functioning half the time

I've been playing around with the Java Jsoup library lately in an attempt to get a better understanding of web scraping (pulling data off a website). But it would seem that the code I managed to put together only functions part of the time. Is the issue with my code, or is it possible that certain sites have measures to stop web scraping?

Here is the class that does all the 'magic':

import java.io.IOException;
import org.jsoup.*;
import org.jsoup.nodes.Document;




public class HTMLParser {

private Document d;
private String url;
private String content;



    public HTMLParser(String url){
    this.url = url; 
     connect();
     parse();
     display();

    }


    private void connect(){ 
        try{
        d = Jsoup.connect(url).get();   
        }catch(IOException e){}
    }

    private void parse(){
        content = d.body().text();

    }

    private void display(){
        System.out.println(content);

    }

}

You can use https://github.com/subes/invesdwin-webproxy with its HtmlUnit Javascript headless browser support to wait for the page to render/load data/execute JS/do its Ajax magic before actually doing the parsing.

You might also have a problem if the site dynamically loads data. Especially in this age of AJAX. Does JSoup ignore robot.txt, or can you make it do so?

Ideally you need to render the page, and THEN scrape it.

This software apparently renders web pages: http://lobobrowser.org/java-browser.jsp And there's certainly an API, which might allow you to look into the webpage's structure.

You can web scrape without Jsoup.

public class Trick {
public static void main(String[] args) {
String str;
URLConnection con;

//HAVE TO HAVE TRY CATCH HERE OR THROW IT

con =  new URL("ANY URL").openConnection();
Scanner scanner = new Scanner(con.getInputStream());
scanner.useDelimiter(INPUT ANY DELIMETER);
str = scanner.next();
scanner.close();



str = str.substring(content.indexOf("NAME OF CLASS OF ID") + INPUT A NUMBER 
WHICH SIGNIFIES HOW MANY INDEXES YOU WANT IT TO NOT CONSIDER STARTING FROM THE 
LEFT);
String wow = str.substring(0, content.indexOf("WHERE YOU WANT IT TO END OR STOP 
SCRAPING"));
System.out.println(wow);
str = str.substring(content.indexOf("WHERE YOU WANT IT TO END OR STOP 
SCRAPING"));
}
//System.out.println(wow);}}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM