简体   繁体   中英

Jsoup, html scraping

So I'm trying to scrape a couple pieces of html (see below). The html has a repeating div (here: class data). From this class I'm trying to scrape the name, stat1 and stat 2. So I start with: getElementsByClass. But how do I proceed from here? how do I get the 3 elements separately?

This is what I got so far, but I just take all the text, not the 3 pieces separately:

html.html

<html>
    <div class='data'>
        <a href='/url1'>
            <div class='name'>name1</div>
            <div class='stat'>123</div>
            <div class='stat2'>456</div>
        </a>
    </div>
    <div class='data'>
        <a href='/url2'>
            <div class='name'>name2</div>
            <div class='stat'>123.1</div>
            <div class='stat2'>456.2</div>
        </a>
    </div>
</html>

JsoupTesting.java

package JsoupTest;

import java.io.File;
import java.io.IOException;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

public class JsoupTesting {

    public static void main(String[] args) throws IOException {

        File input = new File("html.html"); //path to html.html
        Document doc = Jsoup.parse(input, "UTF-8");

        Elements contents = doc.getElementsByClass("data");

        for (Element content : contents) {
            String text = content.text();
            System.out.println("name: " + text + "\n----");
        }

    }

}

Result:

name: name1 123 456
----
name: name2 123.1 456.2
----

I would like something like:

name: name1 
stat: 123 
stat2: 456
----
name: name2 
stat: 123.1 
stat2: 456.2
----

Thanks to BackSlash comment I got it to work, not very hard he just told me what to do :)

package JsoupTest;

import java.io.File;
import java.io.IOException;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

public class JsoupTesting {

    public static void main(String[] args) throws IOException {

        File input = new File("html.html"); //path to html.html
        Document doc = Jsoup.parse(input, "UTF-8");

        Elements contents = doc.getElementsByClass("data");

        for (Element content : contents) {
            String name = content.getElementsByClass("name").first().html();
            String stat = content.getElementsByClass("stat").first().html();
            String stat2 = content.getElementsByClass("stat2").first().html();
            System.out.println("name: " + name);
            System.out.println("stat: " + stat);
            System.out.println("stat2: " + stat2 + "\n----");
        }

    }

}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM