[英]How to get data from page source using java and jsoup
如何從以下頁面來源中獲取$ 23,000,000和$ 47,351,251的值? 我只想從源代碼中獲取這些值,但是我不確定做到這一點的最佳方法。
<div class="txt-block">
<h4 class="inline">Budget:</h4>$23,000,000
<span class="attribute">(estimated)</span>
</div>
<div class="txt-block">
<h4 class="inline">Opening Weekend USA:</h4> $260,382,
<span class="attribute">20 December 2013</span>, <span class="attribute">Limited Release</span>
</div>
<div class="txt-block">
<h4 class="inline">Gross USA:</h4> $25,568,251
</div>
<div class="txt-block">
<h4 class="inline">Cumulative Worldwide Gross:</h4> $47,351,251
</div>
我這樣嘗試過:
String url = "https://www.imdb.com/title/tt1798709";
Connection connection = Jsoup.connect(url);
Document document = connection.get();
Elements element = document.getElementsByClass("txt-block");
String gross = "";
String budget = "";
String budgetRegex = "Budget:.*";
String grossRegex = "Cumulative Worldwide Gross:.*";
for (Element e : element) {
if (e.text().matches(budgetRegex)) {
String text = e.text();
budget = StringUtils.substringBetween(text, "$", " ");
break;
} else {
budget = null;
}
}
for (Element e : element) {
if (e.text().matches(grossRegex)) {
String text = e.text();
gross = StringUtils.substringAfter(text, "$");
break;
} else {
gross = null;
}
}
System.out.println(gross + ", " + budget);
它正在工作,但是有更好的解決方案嗎?
使用ownText()
而不是子字符串,僅循環一次,而不是兩次。 嘗試這個:
String url = "https://www.imdb.com/title/tt1798709";
Connection connection = Jsoup.connect(url);
Document document = connection.get();
Elements elements = document.select("div.txt-block");
String gross = "";
String budget = "";
final String budgetRegex = "Budget:";
final String grossRegex = "Cumulative Worldwide Gross:";
for (Element e : elements) {
final String h4Text = e.getElementsByTag("h4").first().text();
switch (h4Text) {
case budgetRegex:
budget = e.ownText();
break;
case grossRegex:
gross = e.ownText();
break;
}
if (!gross.isEmpty() && !budget.isEmpty()) { //this IF is optional, just added for performance
break;
}
}
System.out.println(gross + ", " + budget);
您可以使用jsoup偽選擇器來完成這項工作:
Document document = Jsoup.parse(html);
String budget = document.select("div:contains(Budget:)").first().ownText();
String gross = document.select("div:contains(Cumulative Worldwide Gross:)").first().ownText();
System.out.println(gross + ", " + budget);
有關偽選擇器的更多信息,請參見: https : //jsoup.org/cookbook/extracting-data/selector-syntax
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.