[英]How to access the subclass using jsoup
I want to access this webpage: https://www.google.com/trends/explore#q=ice%20cream and extract the data within in the center line graph.我想访问此网页: https : //www.google.com/trends/explore#q=ice%20cream并提取中心线图中的数据。 The html file is(Here, I only paste the part that I use.):
html文件是(在这里,我只粘贴我使用的部分。):
<div class="center-col">
<div class="comparison-summary-title-line">...</div>
...
<div id="reportContent" class="report-content">
<!-- This tag handles the report titles component -->
...
<div id="report">
<div id="reportMain">
<div class="timeSection">
<div class = "primaryBand timeBand">...</div>
...
<div aria-lable = "one-chart" style = "position: absolute; ...">
<svg ....>
...
<script type="text/javascript">
var chartData = {...}
And the data I used is stored in the script part(last line).我使用的数据存储在脚本部分(最后一行)中。 My idea is to get the class "report-content" first, and then select script.
我的想法是先获取类“report-content”,然后选择脚本。 And my code follows as:
我的代码如下:
String html = "https://www.google.com/trends/explore#q=ice%20cream";
Document doc = Jsoup.connect(html).get();
Elements center = doc.getElementsByClass("center-col");
Element report = doc.getElementsByClass("report-content");
System.out.println(center);
System.out.println(report);
When I print "center" class, I can get all the subclasses content except the "report-content", and when I print the "report-content", the result is only like:当我打印“center”类时,我可以得到除“report-content”之外的所有子类内容,当我打印“report-content”时,结果只有这样:
<div id="reportContent" Class="report-content"></div>
And I also try this:我也试试这个:
Element report = doc.select(div.report-content).first();
but still does not work at all.但仍然根本不起作用。 How could I get the data in the script here?
我怎么能在这里获取脚本中的数据? I appreciate your help!!!
我感谢您的帮助!!!
Try this url instead:试试这个网址:
https://www.google.com/trends/trendsReport?hl=en&q=${keywords}&tz=${timezone}&content=1
where在哪里
${keywords}
is an encoded space separated keywords list ${keywords}
是一个编码的空格分隔的关键字列表${timezone}
is an encoded timezone in the Etc/GMT* form ${timezone}
是 Etc/GMT* 形式的编码时区String myKeywords = "ice cream";
String myTimezone = "Etc/GMT+2";
String url = "https://www.google.com/trends/trendsReport?hl=en&q=" + URLEncoder.encode(keywords, "UTF-8") +"&tz="+URLEncoder.encode(myTimezone, "UTF-8")+"&content=1";
Document doc = Jsoup.connect(url).timeout(10000).get();
Element scriptElement = doc.select("div#TIMESERIES_GRAPH_0-time-chart + script").first();
if (scriptElement==null) {
throw new RuntimeException("Unable to locate trends data.");
}
String jsCode = scriptElement.html();
// parse jsCode to extract charData...
References:参考:
尝试通过 Id 获得相同的结果,您将获得完整的标签
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.