简体   繁体   中英

Screen Scraping using JSoup

I want to get data from this web site with web scraping. http://myservices.ect.nl/tracing/objectstatus/Pages/Overview.aspx :

在此输入图像描述

I used JSoup before for more static HTML sites, but this one is more difficult because before I get the HTML table on the site have to click one button and I don't know if it's possible to use JSoup to manipulate the button.

After click this button I get a HTML table, I want to get data only where modality is Barge.

Thank you for your tip to use Firefox, now I have the table with some another page information. Can you tell me how can i get only table information? Output that I get is as follows:

在此输入图像描述

You will have to use Selenium HTML Unit Driver for that.

Selenium Info

Maven/Download Binary JAR

HTML Unit Driver

Here is full working example . It will visit the website , click the button and then you can get the data from the page.

Edit: Only get the table value

import org.openqa.selenium.By;
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.WebElement;
import org.openqa.selenium.firefox.FirefoxDriver;
import org.openqa.selenium.support.ui.Select;

public class GetData {

    public static void main(String args[]) throws InterruptedException {
        WebDriver driver = new FirefoxDriver();
        driver.get("http://myservices.ect.nl/tracing/objectstatus/Pages/Overview.aspx");
        Thread.sleep(5000);
        // select barge
        new Select(driver.findElement(By.id("ctl00_ctl15_g_ce17bd4b_3803_47f6_822a_2b8dd10fc67d_ctl00_dlModality"))).selectByVisibleText("Barge");
        // click button
        Thread.sleep(3000);
        driver.findElement(By.className("button80")).click();
        Thread.sleep(5000);

        //get only table text
        WebElement findElement = driver.findElement(By.className("grid-view"));
        String htmlTableText = findElement.getText();
        // do whatever you want now, These are raw table values.
        System.out.println(htmlTableText);

        driver.close();
        driver.quit();    
    }
}

Every "click" (or any interaction of that sort) is a request to the server and a response to the browser. So, a possible solution is not to use JSoup for the initial page, but for the result page. For instance, open a POST to the page that returns the table, passing the parameter responsible for returning the modality Barge . You can use a tool like Firebug (for Firefox) or Chrome Developer Tools to check what's the conversation (request/response), so that you can emulate that with your own code.

Maybe browser emulator for java will be useful for your problem - please consider this one - HtmlUnit.

It models HTML documents and provides an API that allows you to invoke pages, fill out forms, click links, etc... just like you do in your "normal" browser.

HTMLUnit

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM