繁体   English   中英

刮刮多页

[英]Scraping Multiple Pages Scrapy

我试图每年都爬到Billboard的前100名中。我有一个文件,可以一次使用一年,但是我希望它可以在所有年份中抓取并收集这些数据。 这是我当前的代码:

from scrapy import Spider
from scrapy.selector import Selector
from Billboard.items import BillboardItem
from scrapy.exceptions import CloseSpider
from scrapy.http import Request

URL = "http://www.billboard.com/archive/charts/%/hot-100"

class BillboardSpider(Spider):
    name = 'Billboard_spider'
    allowed_urls = ['http://www.billboard.com/']
    start_urls = [URL % 1958]

def _init_(self):
            self.page_number=1958

def parse(self, response):
            print self.page_number
            print "----------"

    rows = response.xpath('//*[@id="block-system-main"]/div/div/div[2]/table/tbody/tr').extract()

    for row in rows:
        IssueDate = Selector(text=row).xpath('//td[1]/a/span/text()').extract()
        Song = Selector(text=row).xpath('//td[2]/text()').extract()
        Artist = Selector(text=row).xpath('//td[3]/a/text()').extract()


        item = BillboardItem()
        item['IssueDate'] = IssueDate
        item['Song'] = Song
        item['Artist'] = Artist


        yield item
            self.page_number += 1
            yield Request(URL % self.page_number)

但出现错误:“ start_urls = [URL%1958] ValueError:索引41处不支持的格式字符'/'(0x2f)”

有任何想法吗? 我希望代码从原始的“ URL”链接自动将年份更改为1959,并逐年进行下去,直到它停止查找表格,然后关闭。

您收到的错误是因为您没有使用正确的语法进行字符串格式化。 您可以在这里查看其工作方式的详细信息。 在特定情况下不起作用的原因是您的URL缺少“ s”:

URL = "http://www.billboard.com/archive/charts/%/hot-100"

应该

URL = "http://www.billboard.com/archive/charts/%s/hot-100"

无论如何,最好使用新样式的字符串格式:

URL = "http://www.billboard.com/archive/charts/{}/hot-100"
start_urls = [URL.format(1958)]

继续,您的代码还有其他一些问题:

def _init_(self):
    self.page_number=1958

如果要使用init函数,则应将其命名为__init__ (两个下划线),并且由于要扩展Spider ,因此需要传递*args**kwargs以便可以调用父构造函数:

def __init__(self, *args, **kwargs):
    super(MySpider, self).__init__(*args, **kwargs)
    self.page_number = 1958

这听起来像你可能会关闭不使用更好的__init__ ,而是只用一个列表理解生成所有从一开始走的网址:

start_urls = ["http://www.billboard.com/archive/charts/{year}/hot-100".format(year=year) 
                  for year in range(1958, 2017)]

start_urls将如下所示:

['http://www.billboard.com/archive/charts/1958/hot-100',
 'http://www.billboard.com/archive/charts/1959/hot-100',
 'http://www.billboard.com/archive/charts/1960/hot-100',
 'http://www.billboard.com/archive/charts/1961/hot-100',
 ...
 'http://www.billboard.com/archive/charts/2017/hot-100']

您还没有正确填充BillboardItem ,因为对象(默认情况下)不支持项目分配:

 item = BillboardItem()
 item['IssueDate'] = IssueDate
 item['Song'] = Song
 item['Artist'] = Artist

应该:

item = BillboardItem()
item.IssueDate = IssueDate
item.Song = Song
item.Artist = Artist

尽管通常最好在类的init函数中执行此操作:类BillboardItem(object):def init (自我,issue_date,歌曲,歌手):self.issue_date = issue_date self.song =歌曲self.artist = artist然后通过item = BillboardItem(IssueDate, Song, Artist)创建项目

更新

无论如何,我清理了您的代码(并创建了BillboardItem,因为我不完全了解您的外观):

from scrapy import Spider, Item, Field
from scrapy.selector import Selector
from scrapy.exceptions import CloseSpider
from scrapy.http import Request


class BillboardItem(Item):
    issue_date = Field()
    song = Field()
    artist = Field()


class BillboardSpider(Spider):
    name = 'billboard'
    allowed_urls = ['http://www.billboard.com/']
    start_urls = ["http://www.billboard.com/archive/charts/{year}/hot-100".format(year=year)
            for year in range(1958, 2017)]


    def parse(self, response):
        print(response.url)
        print("----------")

        rows = response.xpath('//*[@id="block-system-main"]/div/div/div[2]/table/tbody/tr').extract()

        for row in rows:
            issue_date = Selector(text=row).xpath('//td[1]/a/span/text()').extract()
            song = Selector(text=row).xpath('//td[2]/text()').extract()
            artist = Selector(text=row).xpath('//td[3]/a/text()').extract()

            item = BillboardItem(issue_date=issue_date, song=song, artist=artist)

            yield item

希望这可以帮助。 :)

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM