简体   繁体   中英

Using select for web scraping in Jsoup

I'm new to web scraping and my limit is being able to scrape the title of a page in IMDB

I am using this at the moment:

String contentText = doc.select("title").first().text();

Which produces the string: Thor: The Dark World (2013) - IMDb

If anyone could help me, I am trying to get title and the year as separate strings:

" Thor: The Dark World " " 2013 "

Thanks in advance!

String docTitle = doc.select("title").first().text();
String movieName = docTitle.substring(0,docTitle.indexOf("("));
int movieReleaseDate = Integer.parseInt(docTitle.substring(docTitle.indexOf("(")+1,
                                             docTitle.indexOf(")")));

Well if you look at the source for this page you will see further down in the document the following:

<h1 class="header">
<span class="itemprop" itemprop="name">Thor: The Dark World</span> 
<span class="nobr">(<a href="/year/2013/?ref_=tt_ov_inf" >2013</a>)</span>    
</h1>

So it would seem you can get the required text without any further hacking.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM