简体   繁体   English

刮刮多页

[英]Scraping Multiple Pages Scrapy

I'm trying to scrape every year of the top Billboard top 100. I have a file that works for one year at a time, but I want it to crawl through all years and gather that data as well. 我试图每年都爬到Billboard的前100名中。我有一个文件,可以一次使用一年,但是我希望它可以在所有年份中抓取并收集这些数据。 Here is my current code: 这是我当前的代码:

from scrapy import Spider
from scrapy.selector import Selector
from Billboard.items import BillboardItem
from scrapy.exceptions import CloseSpider
from scrapy.http import Request

URL = "http://www.billboard.com/archive/charts/%/hot-100"

class BillboardSpider(Spider):
    name = 'Billboard_spider'
    allowed_urls = ['http://www.billboard.com/']
    start_urls = [URL % 1958]

def _init_(self):
            self.page_number=1958

def parse(self, response):
            print self.page_number
            print "----------"

    rows = response.xpath('//*[@id="block-system-main"]/div/div/div[2]/table/tbody/tr').extract()

    for row in rows:
        IssueDate = Selector(text=row).xpath('//td[1]/a/span/text()').extract()
        Song = Selector(text=row).xpath('//td[2]/text()').extract()
        Artist = Selector(text=row).xpath('//td[3]/a/text()').extract()


        item = BillboardItem()
        item['IssueDate'] = IssueDate
        item['Song'] = Song
        item['Artist'] = Artist


        yield item
            self.page_number += 1
            yield Request(URL % self.page_number)

but I'm getting error: "start_urls = [URL % 1958] ValueError: unsupported format character '/' (0x2f) at index 41" 但出现错误:“ start_urls = [URL%1958] ValueError:索引41处不支持的格式字符'/'(0x2f)”

Any ideas? 有任何想法吗? I want the code to change the year to 1959 automatically from the original "URL" link, and keep going year by year until it stops finding the table, and then close out. 我希望代码从原始的“ URL”链接自动将年份更改为1959,并逐年进行下去,直到它停止查找表格,然后关闭。

The error you're getting is because you're not using the correct syntax for string formatting. 您收到的错误是因为您没有使用正确的语法进行字符串格式化。 You can have a look here for details on how it works. 您可以在这里查看其工作方式的详细信息。 The reason it doesn't work in your particular case is that your URL is missing an 's': 在特定情况下不起作用的原因是您的URL缺少“ s”:

URL = "http://www.billboard.com/archive/charts/%/hot-100"

should be 应该

URL = "http://www.billboard.com/archive/charts/%s/hot-100"

Anyway it's better to use new style string formatting: 无论如何,最好使用新样式的字符串格式:

URL = "http://www.billboard.com/archive/charts/{}/hot-100"
start_urls = [URL.format(1958)]

Moving on, your code has some other problems: 继续,您的代码还有其他一些问题:

def _init_(self):
    self.page_number=1958

if you want to use an init function, it should be named __init__ (two underscores) and because you're extending Spider , you need to pass *args and **kwargs so you can call the parent constructor: 如果要使用init函数,则应将其命名为__init__ (两个下划线),并且由于要扩展Spider ,因此需要传递*args**kwargs以便可以调用父构造函数:

def __init__(self, *args, **kwargs):
    super(MySpider, self).__init__(*args, **kwargs)
    self.page_number = 1958

it sounds like you might be better off not using __init__ and instead just using a list comprehension to generate all the urls from the get go: 这听起来像你可能会关闭不使用更好的__init__ ,而是只用一个列表理解生成所有从一开始走的网址:

start_urls = ["http://www.billboard.com/archive/charts/{year}/hot-100".format(year=year) 
                  for year in range(1958, 2017)]

start_urls will then look like this: start_urls将如下所示:

['http://www.billboard.com/archive/charts/1958/hot-100',
 'http://www.billboard.com/archive/charts/1959/hot-100',
 'http://www.billboard.com/archive/charts/1960/hot-100',
 'http://www.billboard.com/archive/charts/1961/hot-100',
 ...
 'http://www.billboard.com/archive/charts/2017/hot-100']

you're also not populating your BillboardItem correctly, as objects don't (by default) support item assignment: 您还没有正确填充BillboardItem ,因为对象(默认情况下)不支持项目分配:

 item = BillboardItem()
 item['IssueDate'] = IssueDate
 item['Song'] = Song
 item['Artist'] = Artist

should be: 应该:

item = BillboardItem()
item.IssueDate = IssueDate
item.Song = Song
item.Artist = Artist

although it's generally better to just do that in the class' init function: class BillboardItem(object): def init (self, issue_date, song, artist): self.issue_date = issue_date self.song = song self.artist = artist and then create the item by item = BillboardItem(IssueDate, Song, Artist) 尽管通常最好在类的init函数中执行此操作:类BillboardItem(object):def init (自我,issue_date,歌曲,歌手):self.issue_date = issue_date self.song =歌曲self.artist = artist然后通过item = BillboardItem(IssueDate, Song, Artist)创建项目

Updated 更新

Anyway, I cleaned up your code (and created a BillboardItem as I don't exactly know how yours looks): 无论如何,我清理了您的代码(并创建了BillboardItem,因为我不完全了解您的外观):

from scrapy import Spider, Item, Field
from scrapy.selector import Selector
from scrapy.exceptions import CloseSpider
from scrapy.http import Request


class BillboardItem(Item):
    issue_date = Field()
    song = Field()
    artist = Field()


class BillboardSpider(Spider):
    name = 'billboard'
    allowed_urls = ['http://www.billboard.com/']
    start_urls = ["http://www.billboard.com/archive/charts/{year}/hot-100".format(year=year)
            for year in range(1958, 2017)]


    def parse(self, response):
        print(response.url)
        print("----------")

        rows = response.xpath('//*[@id="block-system-main"]/div/div/div[2]/table/tbody/tr').extract()

        for row in rows:
            issue_date = Selector(text=row).xpath('//td[1]/a/span/text()').extract()
            song = Selector(text=row).xpath('//td[2]/text()').extract()
            artist = Selector(text=row).xpath('//td[3]/a/text()').extract()

            item = BillboardItem(issue_date=issue_date, song=song, artist=artist)

            yield item

Hope this helps. 希望这可以帮助。 :) :)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM