簡體   English   中英

如何從網站上抓取多個頁面?

[英]How to scrape multiple pages from a website?

(非常)Python和一般編程的新手

我一直在嘗試使用Scrapy從同一網站的更多頁面/部分中抓取數據

我的代碼有效,但是不可讀且不切實際

import scrapy

class SomeSpider(scrapy.Spider):
    name = 'some'
    allowed_domains = ['https://example.com']
    start_urls = [
        'https://example.com/Python/?k=books&p=1',
        'https://example.com/Python/?k=books&p=2',
        'https://example.com/Python/?k=books&p=3',
        'https://example.com/Python/?k=tutorials&p=1',
        'https://example.com/Python/?k=tutorials&p=2',
        'https://example.com/Python/?k=tutorials&p=3',
     ]

     def parse(self, response):
         response.selector.remove_namespaces()

         info1 = response.css("scrapedinfo1").extract()
         info2 = response.css("scrapedinfo2").extract()

         for item in zip(scrapedinfo1, scrapedinfo2):
           scraped_info = {
              'scrapedinfo1': item[0],
              'scrapedinfo2': item[1]}

              yield scraped_info

我該如何改善?

我想在一定數量的類別和頁面中進行搜索

我需要類似的東西

categories = [books, tutorials, a, b, c, d, e, f] 
in a range(1,3)

這樣Scrapy就能在所有類別和頁面上完成工作,同時易於編輯和適應其他網站

歡迎任何想法

我嘗試過的

categories = ["books", "tutorials"]
base = "https://example.com/Python/?k={category}&p={index}"

def url_generator():
    for category, index in itertools.product(categories, range(1, 4)):
        yield base.format(category=category, index=index)

但是Scrapy回來了

[scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), 
scraped 0 items (at 0 items/min)

解決了start_requests()yield scrapy.Request()

這是代碼

import scrapy
import itertools


class SomeSpider(scrapy.Spider):
    name = 'somespider'
    allowed_domains = ['example.com']

    def start_requests(self):
        categories = ["books", "tutorials"]
        base = "https://example.com/Python/?k={category}&p={index}"

        for category, index in itertools.product(categories, range(1, 4)):
            yield scrapy.Request(base.format(category=category, index=index))

    def parse(self, response):
        response.selector.remove_namespaces()

        info1 = response.css("scrapedinfo1").extract()
        info2 = response.css("scrapedinfo2").extract()

        for item in zip(info1, info2):
            scraped_info = {
                'scrapedinfo1': item[0],
                'scrapedinfo2': item[1],
            }

            yield scraped_info

您可以使用start_requests()方法在開始時使用yield Request(url)生成url。

順便說一句:稍后在parse()您還可以使用yield Request(url)添加新的url。

我使用為測試蜘蛛創建的門戶網站toscrape.com

import scrapy

class MySpider(scrapy.Spider):

    name = 'myspider'

    allowed_domains = ['http://quotes.toqoute.com']

    #start_urls = []

    tags = ['love', 'inspirational', 'life', 'humor', 'books', 'reading']
    pages = 3
    url_template = 'http://quotes.toscrape.com/tag/{}/page/{}'

    def start_requests(self):

        for tag in self.tags:
            for page in range(self.pages):
                url = self.url_template.format(tag, page)
                yield scrapy.Request(url)


    def parse(self, response):
        # test if method was executed
        print('url:', response.url)

# --- run it without project ---

from scrapy.crawler import CrawlerProcess

#c = CrawlerProcess({
#    'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)',
#    'FEED_FORMAT': 'csv',
#    'FEED_URI': 'output.csv',
#}

c = CrawlerProcess()
c.crawl(MySpider)
c.start()

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM