简体   繁体   中英

How can I scrape the HTML data which I want in Java?

I'm practicing and scraping datas from sites. I've stucked within a site which URL is https://www.ilan.gov.tr/ilan/kategori/12/iflas-hukuku-davalari?txv=12&currentPage=1 . I want to get Kurum - İlan Numarası - Şehir ( Corporation - Notice Number - City ) datas. I can't scrape div I think. When I compile the code which includes this code div.search-results-header row It doesn't work. Also I want to get first 20 pages of this site. How can I do this? There are complicated bunch of code so I'm adding images as attachments. If you tell me at least how can I get Kurum I think I can handle others. Thank you. 在此处输入图片说明

However, this is the code what I'm working on for project.

public static void main(String[] args) throws Exception {

    File iflasHukuku = new File("/Users/Berkan/Desktop/Iflas Hukuku.txt");
    iflasHukuku.createNewFile();

    FileWriter fileWriter = new FileWriter(iflasHukuku);
    BufferedWriter bufferedWriter = new BufferedWriter(fileWriter);

    final Document document = Jsoup.connect("https://www.ilan.gov.tr/ilan/kategori/12/iflas-hukuku-davalari?txv=12&currentPage=1").get();


    for(Element x: document.select(".search-results-table-container container mb-4 ng-tns-c6-3 ng-star-inserted")) {

        final String kurumAdi = x.select("div.search-results-header row").text();
        System.out.println(kurumAdi);

    }

    }

It appears the webpage is Angular App. So, you cannot simply grab the HTML content using Jsoup.connect because the browser needs to execute the JS to render the page. So, you have to use WebDriver to load the content and get the pageSource and send that to Jsoup.

See this:

import io.github.bonigarcia.wdm.WebDriverManager;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.chrome.ChromeDriver;
import org.openqa.selenium.chrome.ChromeOptions;
import org.openqa.selenium.support.ui.WebDriverWait;

public class JSoupTest {

    public static void main(String[] args) {
        WebDriverManager.chromedriver().setup(); //downloads the driver

        ChromeOptions chromeOptions = new ChromeOptions();
        chromeOptions.setHeadless(true);

        WebDriver driver = new ChromeDriver(chromeOptions);
        driver.get("https://www.ilan.gov.tr/ilan/kategori/12/iflas-hukuku-davalari?txv=12&currentPage=1");

        WebDriverWait wait = new WebDriverWait(driver, 30);
        wait.until(webDriver -> driver.getPageSource().contains("İlan Açıklaması"));

        final Document document = Jsoup.parse(driver.getPageSource());

        Elements xx = document.select(".search-results-row");

        for (Element x : document.select(".search-results-row")) {

            System.out.println(x.text());
            //parse it further
        }

    }


}

Required Dependencies:

        <dependency>
            <groupId>io.github.bonigarcia</groupId>
            <artifactId>webdrivermanager</artifactId>
            <version>4.2.2</version>
        </dependency>
        <dependency>
            <groupId>org.seleniumhq.selenium</groupId>
            <artifactId>selenium-chrome-driver</artifactId>
            <version>3.141.59</version>
        </dependency>
        <dependency>
            <groupId>org.seleniumhq.selenium</groupId>
            <artifactId>selenium-support</artifactId>
            <version>3.141.59</version>
        </dependency>

        <dependency>
            <groupId>com.google.guava</groupId>
            <artifactId>guava</artifactId>
            <version>28.2-jre</version>
        </dependency>
        <dependency>
            <groupId>org.jsoup</groupId>
            <artifactId>jsoup</artifactId>
            <version>1.13.1</version>
        </dependency>

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM