如何使用Java和Jsoup从页面源获取数据

Question

How can I get the values $23,000,000 and $47,351,251 from the following page source? 如何从以下页面来源中获取$ 23,000,000和$ 47,351,251的值？ I want to get only these values from the source code, but I'm not sure the best way to do it. 我只想从源代码中获取这些值，但是我不确定做到这一点的最佳方法。

  <div class="txt-block">
            <h4 class="inline">Budget:</h4>$23,000,000
            <span class="attribute">(estimated)</span>
        </div>

        <div class="txt-block">
            <h4 class="inline">Opening Weekend USA:</h4> $260,382,
<span class="attribute">20 December 2013</span>, <span class="attribute">Limited Release</span>
        </div>

        <div class="txt-block">
<h4 class="inline">Gross USA:</h4> $25,568,251
        </div>
        <div class="txt-block">
<h4 class="inline">Cumulative Worldwide Gross:</h4> $47,351,251
        </div>

I tried like this: 我这样尝试过：

    String url = "https://www.imdb.com/title/tt1798709";
    Connection connection = Jsoup.connect(url);
    Document document = connection.get();
    Elements element = document.getElementsByClass("txt-block");


    String gross = "";
    String budget = "";

    String budgetRegex = "Budget:.*";
    String grossRegex = "Cumulative Worldwide Gross:.*";

    for (Element e : element) {
        if (e.text().matches(budgetRegex)) {
            String text = e.text();
            budget = StringUtils.substringBetween(text, "$", " ");
            break;
        } else {
            budget = null;
        }
    }
    for (Element e : element) {
        if (e.text().matches(grossRegex)) {
            String text = e.text();
            gross = StringUtils.substringAfter(text, "$");
                break;
        } else {
            gross = null;
        }

    }
    System.out.println(gross + ", " + budget);

It's working, but is there a better solution? 它正在工作，但是有更好的解决方案吗？

Answer 1

Use ownText() instead of substring and loop only once, not twice. 使用ownText()而不是子字符串，仅循环一次，而不是两次。 Try this: 尝试这个：

    String url = "https://www.imdb.com/title/tt1798709";
    Connection connection = Jsoup.connect(url);
    Document document = connection.get();
    Elements elements = document.select("div.txt-block");

    String gross = "";
    String budget = "";

    final String budgetRegex = "Budget:";
    final String grossRegex = "Cumulative Worldwide Gross:";

    for (Element e : elements) {
        final String h4Text = e.getElementsByTag("h4").first().text();
        switch (h4Text) {
            case budgetRegex:
                budget = e.ownText();
                break;
            case grossRegex:
                gross = e.ownText();
                break;
        }
        if (!gross.isEmpty() && !budget.isEmpty()) { //this IF is optional, just added for performance
            break;
        }
    }
    System.out.println(gross + ", " + budget);

Answer 2

You can use jsoup pseudo selectors to do the job: 您可以使用jsoup伪选择器来完成这项工作：

    Document document = Jsoup.parse(html);
    String budget = document.select("div:contains(Budget:)").first().ownText();
    String gross = document.select("div:contains(Cumulative Worldwide Gross:)").first().ownText();
    System.out.println(gross + ", " + budget);

More about pseudo selector you can find here: https://jsoup.org/cookbook/extracting-data/selector-syntax 有关伪选择器的更多信息，请参见： https : //jsoup.org/cookbook/extracting-data/selector-syntax

如何使用Java和Jsoup从页面源获取数据

问题描述

2 个解决方案

解决方案1
0 2018-07-18 02:21:31

解决方案2
0 2018-07-18 14:23:09

如何使用Java和Jsoup从页面源获取数据

问题描述

2 个解决方案

解决方案1 0 2018-07-18 02:21:31

解决方案2 0 2018-07-18 14:23:09

解决方案1
0 2018-07-18 02:21:31

解决方案2
0 2018-07-18 14:23:09