使用python中的scrapy包進行數據爬網

Question

我正在嘗試使用“ scrapy”包從網站（IMDB）獲取圖像數據。
如果div類中有image_URL，那么我可以使用電影海報抓取數據。 但是，如果沒有，我的代碼將無法正常工作。 它跳過了一些與圖像相關的數據。
我想像沒有image_URL一樣修復它，然后忘記圖像而只是抓取數據。
我該如何修復零件以外的零件？

def parse（自我，回應）：

//some other lines

try:
        poster_image_url = 
        response.xpath('//div[@class="poster"]/a/img/@src').extract()[0]
        poster_image_url = [ poster_image_url.split("_V1_")[0] + "_V1_.jpg" ]

except:
        poster_image_url = None
        item['image_urls'] = poster_image_url

這是管道代碼↓↓↓↓

ImdbPipeline（object）類：

def process_item(self, item, spider):
    return item

def get_media_requests(self, item, info):
    for image_url in item['image_urls']:
        yield scrapy.Request(image_url)

Answer 1

您可以將extract_first()與if檢查一起使用：

poster_image_url = response.xpath('//div[@class="poster"]/a/img/@src').extract_first()
if poster_image_url:
    item['image_urls'] = poster_image_url.split('_V1')[0] + '_V1_.jgp'

另外，您可以使用scrapy ItemLoader的。

使用python中的scrapy包進行數據爬網

問題描述

1 個解決方案

解決方案1
0 2017-04-25 11:03:45

使用python中的scrapy包進行數據爬網

問題描述

1 個解決方案

解決方案1 0 2017-04-25 11:03:45

解決方案1
0 2017-04-25 11:03:45