Jsoup，HTML抓取

Question

So I'm trying to scrape a couple pieces of html (see below). 因此，我正在尝试抓取几段html（请参见下文）。 The html has a repeating div (here: class data). html具有重复的div（此处为类数据）。 From this class I'm trying to scrape the name, stat1 and stat 2. So I start with: getElementsByClass. 从这个类中，我试图抓取名称stat1和stat2。因此，我从getElementsByClass开始。 But how do I proceed from here? 但是我如何从这里开始？ how do I get the 3 elements separately? 如何分别获得3个元素？

This is what I got so far, but I just take all the text, not the 3 pieces separately: 到目前为止，这是我得到的，但是我只接受所有文本，而不是分别提取三部分：

html.html html.html

<html>
    <div class='data'>
        <a href='/url1'>
            <div class='name'>name1</div>
            <div class='stat'>123</div>
            <div class='stat2'>456</div>
        </a>
    </div>
    <div class='data'>
        <a href='/url2'>
            <div class='name'>name2</div>
            <div class='stat'>123.1</div>
            <div class='stat2'>456.2</div>
        </a>
    </div>
</html>

JsoupTesting.java JsoupTesting.java

package JsoupTest;

import java.io.File;
import java.io.IOException;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

public class JsoupTesting {

    public static void main(String[] args) throws IOException {

        File input = new File("html.html"); //path to html.html
        Document doc = Jsoup.parse(input, "UTF-8");

        Elements contents = doc.getElementsByClass("data");

        for (Element content : contents) {
            String text = content.text();
            System.out.println("name: " + text + "\n----");
        }

    }

}

Result: 结果：

name: name1 123 456
----
name: name2 123.1 456.2
----

I would like something like: 我想要类似的东西：

name: name1 
stat: 123 
stat2: 456
----
name: name2 
stat: 123.1 
stat2: 456.2
----

Answer 1

Thanks to BackSlash comment I got it to work, not very hard he just told me what to do :) 多亏了BackSlash的评论，我才开始工作，不是很辛苦，他只是告诉我该怎么做:)

package JsoupTest;

import java.io.File;
import java.io.IOException;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

public class JsoupTesting {

    public static void main(String[] args) throws IOException {

        File input = new File("html.html"); //path to html.html
        Document doc = Jsoup.parse(input, "UTF-8");

        Elements contents = doc.getElementsByClass("data");

        for (Element content : contents) {
            String name = content.getElementsByClass("name").first().html();
            String stat = content.getElementsByClass("stat").first().html();
            String stat2 = content.getElementsByClass("stat2").first().html();
            System.out.println("name: " + name);
            System.out.println("stat: " + stat);
            System.out.println("stat2: " + stat2 + "\n----");
        }

    }

}

Jsoup，HTML抓取

问题描述

1 个解决方案

解决方案1
0 已采纳 2014-09-29 12:25:46

Jsoup，HTML抓取

问题描述

1 个解决方案

解决方案1 0 已采纳 2014-09-29 12:25:46

解决方案1
0 已采纳 2014-09-29 12:25:46