简体   繁体   中英

IMDB Scraper Using Python and Scrapy

Alright, I am new to programming, and I figured the best way to learn would be to program something. Part of my job involves searching for a movie on IMDB and pasting the director, writer, (first four) actors, and a link to the IMDB page in an Excel spreadsheet.

My end goal is to have a CSV with the film title and year, and have the scraper take these variables from the CSV, search IMDB, pull the data, and export the data into a new CSV.


I have reading and researching for about a week. I have gone through the Scrapy tutorial successfully, but I'm having trouble going from there to the desired end.

  • How can I import values from a CSV into my spider script? I am thinking it would look something like this:

     name = COLUMN1 year = COLUMN2 class imdb_spider(scrapy.Spider): name = "imdb" allowed_domains = ["imdb.com"] start_urls = [ "http://www.imdb.com/find?ref_=nv_sr_fn&q=/(name)&(year)" ] 

I am not sure how to pull from a CSV file though.


  • From there, I would need the spider to follow the first link on the page (which would be the film name), and then the "see full cast and crew" link on the subsequent page.

All the information I need would be on this last page: http://www.imdb.com/title/tt0081505/fullcredits?ref_=tt_ov_st_sm


  • Defining what to extract is really puzzling to me.

Here is what I pulled using firebug:

Director:

<td class="name">
<a href="/name/nm0000040/?ref_=ttfc_fc_dr1"> Stanley Kubrick </a>
</td>

Writer:

<td class="name">
<a href="/name/nm0000040/?ref_=ttfc_fc_wr2"> Stanley Kubrick </a>
</td>

Actors (only need first four, if possible):

<td class="itemprop" itemtype="http://schema.org/Person" itemscope="" itemprop="actor">
<td class="ellipsis"> ... </td>

I am not sure how to define the page link itself.


After that, I just need to loop it over the whole list and save a new CSV with the data.

I know this is an intense question, and I'm not asking anyone to code it for me. I'm willing to put in the work if I know where to look/how to figure this out. I am reading through the Scrapy documentation, but it is still unclear.

If there is an obviously better way to do this than Python and Scrapy, let me know.

Thanks.

Edit: Mac OS x 10.10.1, Python 2.7, Scrapy 0.24.4, TextWrangler to edit

The csv module is quite handy, and also useful for tab separated files that have irregular/empty fields. (import csv)

    with open('something_something_darkside.txt', 'rb') as f:
        data = list(csv.reader(f,delimiter='\t'))
        for row in data:

As far as webpages, I found methods of using Beautiful Soup to turn html to xml, and use xml parsers to extract what I needed. These methods may be outdated but still reliable.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM