简体   繁体   English

使用R在imdb中进行web抓取

[英]web scraping in imdb using R

I want to find the link to the top 250 movies in imdb. 我想找到imdb中前250部电影的链接。 I decided to find a common pattern by viewing the HTML source code. 我决定通过查看HTML源代码找到一个共同的模式。 I found "chttp" but I am not sure if it will get me anywhere. 我找到了“chttp”,但我不确定它是否会让我到处都是。 How can I find a pattern to construct the links upon it? 如何找到构建链接的模式?

require("XML")
imdb="http://www.imdb.com/chart/top?sort=ir,desc"
imdb.page=readLines(imdb)
g = grep(pattern = "chttp", x = imdb_page) 
imdb.lines=imdb.page[g]

Here's an example output: 这是一个示例输出:

> imdb.lines[1]
[1] "      <h3><a href=\"/chart/?ref_=chttp_cht\" >IMDb Charts</a></h3>"

My main problem is trying to figure out the link(URL) for each of the 250 top movies based on the code I have already written. 我的主要问题是试图根据我已编写的代码找出250部顶级电影中的每一部电影的链接(URL)。 I basically don't know what's the next step. 我基本上不知道下一步是什么。 Also I am not sure the pattern I used the grep command for "chttp" is a good one at all or not. 另外我不确定我使用grep命令“chttp”的模式是一个好的或不是。

So according to results starting from index 3 the movie titles are on the odd indices: 所以根据从索引3开始的结果,电影标题是奇数索引:

> imdb.lines[1]
[1] "      <h3><a href=\"/chart/?ref_=chttp_cht\" >IMDb Charts</a></h3>"
> imdb.lines[2]
[1] "  <td class=\"posterColumn\"><a href=\"/title/tt0111161/?ref_=chttp_tt_1\" ><img src=\"http://ia.media-imdb.com/images/M/MV5BODU4MjU4NjIwNl5BMl5BanBnXkFtZTgwMDU2MjEyMDE@._V1_SX34_CR0,0,34,50_.jpg\" width=\"34\" height=\"50\" />"
> imdb.lines[3]
[1] "    <a href=\"/title/tt0111161/?ref_=chttp_tt_1\" title=\"Frank Darabont (dir.), Tim Robbins, Morgan Freeman\" >The Shawshank Redemption</a>"
> imdb.lines[6]
[1] "  <td class=\"posterColumn\"><a href=\"/title/tt0071562/?ref_=chttp_tt_3\" ><img src=\"http://ia.media-imdb.com/images/M/MV5BNDc2NTM3MzU1Nl5BMl5BanBnXkFtZTcwMTA5Mzg3OA@@._V1_SX34_CR0,0,34,50_.jpg\" width=\"34\" height=\"50\" />"
> imdb.lines[4]
[1] "  <td class=\"posterColumn\"><a href=\"/title/tt0068646/?ref_=chttp_tt_2\" ><img src=\"http://ia.media-imdb.com/images/M/MV5BMjEyMjcyNDI4MF5BMl5BanBnXkFtZTcwMDA5Mzg3OA@@._V1_SX34_CR0,0,34,50_.jpg\" width=\"34\" height=\"50\" />"
> imdb.lines[5]
[1] "    <a href=\"/title/tt0068646/?ref_=chttp_tt_2\" title=\"Francis Ford Coppola (dir.), Marlon Brando, Al Pacino\" >The Godfather</a>"
> imdb.lines[7]
[1] "    <a href=\"/title/tt0071562/?ref_=chttp_tt_3\" title=\"Francis Ford Coppola (dir.), Al Pacino, Robert De Niro\" >The Godfather: Part II</a>"
> imdb.lines[9]
[1] "    <a href=\"/title/tt0468569/?ref_=chttp_tt_4\" title=\"Christopher Nolan (dir.), Christian Bale, Heath Ledger\" >The Dark Knight</a>"
> imdb.lines[10]
[1] "  <td class=\"posterColumn\"><a href=\"/title/tt0110912/?ref_=chttp_tt_5\" ><img src=\"http://ia.media-imdb.com/images/M/MV5BMjE0ODk2NjczOV5BMl5BanBnXkFtZTYwNDQ0NDg4._V1_SY50_CR0,0,34,50_.jpg\" width=\"34\" height=\"50\" />"

xpath makes jobs like this trivial. xpath让这样的工作变得微不足道。

library(XML)
tt <- htmlParse('http://www.imdb.com/chart/top?sort=ir,desc')
cbind(xpathSApply(tt, "//td[@class='titleColumn']//a", xmlValue),
           t(xpathSApply(tt, "//td[@class='titleColumn']//a", xmlAttrs)))

The first argument to cbind returns titles (the text between the a tags) and the second returns the anchors' attributes (href and title, the latter of which in this case contains details about the films' directors). cbind的第一个参数返回标题( a标签之间的文本),第二个返回锚点的属性(href和title,后者在这种情况下包含有关电影导演的详细信息)。

What about using the alternative interfaces ? 那么使用替代接口呢?

Edit #1 : I have looked into some of the files and there don't seem to be any links or even the imdb ID, there should be another way though. 编辑#1 :我已经查看了一些文件,似乎没有任何链接甚至imdb ID,应该有另一种方式。

Edit #2 : OK, there is no other way apparently, but somebody already did something. 编辑#2 :好的,显然没有其他办法,但有人已经做了一些事情。 Eg this guy ; 比如这家伙 ; have a look. 看一看。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM