简体   繁体   English

Jsoup,HTML抓取

[英]Jsoup, html scraping

So I'm trying to scrape a couple pieces of html (see below). 因此,我正在尝试抓取几段html(请参见下文)。 The html has a repeating div (here: class data). html具有重复的div(此处为类数据)。 From this class I'm trying to scrape the name, stat1 and stat 2. So I start with: getElementsByClass. 从这个类中,我试图抓取名称stat1和stat2。因此,我从getElementsByClass开始。 But how do I proceed from here? 但是我如何从这里开始? how do I get the 3 elements separately? 如何分别获得3个元素?

This is what I got so far, but I just take all the text, not the 3 pieces separately: 到目前为止,这是我得到的,但是我只接受所有文本,而不是分别提取三部分:

html.html html.html

<html>
    <div class='data'>
        <a href='/url1'>
            <div class='name'>name1</div>
            <div class='stat'>123</div>
            <div class='stat2'>456</div>
        </a>
    </div>
    <div class='data'>
        <a href='/url2'>
            <div class='name'>name2</div>
            <div class='stat'>123.1</div>
            <div class='stat2'>456.2</div>
        </a>
    </div>
</html>

JsoupTesting.java JsoupTesting.java

package JsoupTest;

import java.io.File;
import java.io.IOException;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

public class JsoupTesting {

    public static void main(String[] args) throws IOException {

        File input = new File("html.html"); //path to html.html
        Document doc = Jsoup.parse(input, "UTF-8");

        Elements contents = doc.getElementsByClass("data");

        for (Element content : contents) {
            String text = content.text();
            System.out.println("name: " + text + "\n----");
        }

    }

}

Result: 结果:

name: name1 123 456
----
name: name2 123.1 456.2
----

I would like something like: 我想要类似的东西:

name: name1 
stat: 123 
stat2: 456
----
name: name2 
stat: 123.1 
stat2: 456.2
----

Thanks to BackSlash comment I got it to work, not very hard he just told me what to do :) 多亏了BackSlash的评论,我才开始工作,不是很辛苦,他只是告诉我该怎么做:)

package JsoupTest;

import java.io.File;
import java.io.IOException;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

public class JsoupTesting {

    public static void main(String[] args) throws IOException {

        File input = new File("html.html"); //path to html.html
        Document doc = Jsoup.parse(input, "UTF-8");

        Elements contents = doc.getElementsByClass("data");

        for (Element content : contents) {
            String name = content.getElementsByClass("name").first().html();
            String stat = content.getElementsByClass("stat").first().html();
            String stat2 = content.getElementsByClass("stat2").first().html();
            System.out.println("name: " + name);
            System.out.println("stat: " + stat);
            System.out.println("stat2: " + stat2 + "\n----");
        }

    }

}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM