在 Jsoup 中使用选择进行网页抓取

Question

I'm new to web scraping and my limit is being able to scrape the title of a page in IMDB我是网络抓取的新手，我的限制是能够在 IMDB 中抓取页面的标题

I am using this at the moment:我现在正在使用这个：

String contentText = doc.select("title").first().text();

Which produces the string: Thor: The Dark World (2013) - IMDb产生字符串： Thor: The Dark World (2013) - IMDb

If anyone could help me, I am trying to get title and the year as separate strings:如果有人可以帮助我，我正在尝试将标题和年份作为单独的字符串：

" Thor: The Dark World " " 2013 " 《雷神：黑暗世界》《 2013 》

Thanks in advance!提前致谢！

Answer 1

String docTitle = doc.select("title").first().text();
String movieName = docTitle.substring(0,docTitle.indexOf("("));
int movieReleaseDate = Integer.parseInt(docTitle.substring(docTitle.indexOf("(")+1,
                                             docTitle.indexOf(")")));

Answer 2

Well if you look at the source for this page you will see further down in the document the following:好吧，如果您查看此页面的源代码，您将在文档的下方看到以下内容：

<h1 class="header">
<span class="itemprop" itemprop="name">Thor: The Dark World</span> 
<span class="nobr">(<a href="/year/2013/?ref_=tt_ov_inf" >2013</a>)</span>    
</h1>

So it would seem you can get the required text without any further hacking.因此，您似乎无需任何进一步的黑客攻击即可获得所需的文本。

在 Jsoup 中使用选择进行网页抓取

问题描述

2 个解决方案

解决方案1
0 2013-10-17 21:24:35

解决方案2
0 已采纳 2013-10-17 21:46:21

在 Jsoup 中使用选择进行网页抓取

问题描述

2 个解决方案

解决方案1 0 2013-10-17 21:24:35

解决方案2 0 已采纳 2013-10-17 21:46:21

解决方案1
0 2013-10-17 21:24:35

解决方案2
0 已采纳 2013-10-17 21:46:21