简体   繁体   中英

Can't scrape multiple pages using scrapy-playwright api

CONTEXT : I'm just a newbie in web scraping. I was trying to scrape a local e-commerce site. It's a dynamic website so I am using scrapy-playwright(chromium) with proxies.

PROBLEM : It was running smoothly until I tried to scrape multiple pages. I am using multiple Urls with individual page number. But instead of scraping different pages, It's scraping the first page for multiple times. It seems that Playwright is at fault. But I am not sure if it's because wrong code or Bugs. I have tried to do it in different processes but the results are same. I tried with and without Proxies and User-agents. AND CAN'T FIGURE OUT WHY IT'S HAPPENING...

import logging
import scrapy
from scrapy_playwright.page import PageMethod
from helper import should_abort_request


class ABCSpider(scrapy.Spider):
    name = "ABC"
    custom_settings = {
        'PLAYWRIGHT_DEFAULT_NAVIGATION_TIMEOUT': '100000',
        'PLAYWRIGHT_ABORT_REQUEST': should_abort_request
    }

    def start_requests(self):
        yield scrapy.Request(
            url='https://www.daraz.com.bd/xbox-games/?spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO&page=1',
            meta={
                "playwright": True,
                "playwright_include_page": True,
                "playwright_page_methods": [
                    PageMethod("wait_for_selector", '[class="box--LNmE6"]'),
                ],
            },
        )

    async def parse(self, response):

        total= response.xpath('/html/body/div[3]/div/div[2]/div/div/div[1]/div[3]/div/ul/li[last()-1]/a/text()').extract()[0]
        total_pages = int(total)   #total_pages = 4

        links = []

        for i in range(1, total_pages+1):
            a = 'https://www.daraz.com.bd/xbox-games/?spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO&page={}'.format(i)
            
            links.append(a)

        for link in links:
            res = scrapy.Request(url=link, meta={
                    "playwright": True,
                    "playwright_include_page": True,
                    "playwright_page_methods": [
                        PageMethod("wait_for_selector",
                                    '[class="box--ujueT"]'),
                    ]})

            yield res and {
                "link" : response.url 
            }

OUTPUT :

[
{"link": "https://www.daraz.com.bd/xbox-games/?spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO&page=1"},
{"link": "https://www.daraz.com.bd/xbox-games/?spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO&page=1"},
{"link": "https://www.daraz.com.bd/xbox-games/?spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO&page=1"},
{"link": "https://www.daraz.com.bd/xbox-games/?spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO&page=1"}
]

Instead of iterating the pages in the start_requests method you instead are trying to pull number of pages in the parse method and generate further requests from there.

The issue with this strategy is that each one of those requests that you generate in the parse method, is itself parsed by the parse method, so for each and every request you are telling it to generate a whole set of new requests for each and every page it detects from the page number which likely is the same on every page.

Luckily scrapy has a duplicate filter built in so it would likely ignore these duplicates if you were yielding them properly.

The next issue is your yield statement. the expression a and b doesn't return a and b , it only returns b . That is unless a is falsy then it will return a .

So your yield expression...

yield res and {
                "link" : response.url 
            }

will only ever actually yield : {"link": response.url} .


Beyond what I mention above your code doesn't do anything else. However, I am assuming that since you instruct the page to wait for the element with each of the items for sale to render that your eventual goal is to scrape the data from each of the items on the page.

So with this consideration in mind I would suggest that you don't even use scrapy_playwright at all and instead get the data from the json api that the website uses in it's ajax requests.

For example:

import scrapy

class ABCSpider(scrapy.Spider):
    name = "ABC"

    def start_requests(self):
        for i in range(4):
            url = f"https://www.daraz.com.bd/xbox-games/?ajax=true&page={i}&spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO"
            yield scrapy.Request(url)

    def parse(self, response):
        data = response.json()
        items = data["mods"]["listItems"]
        for item in items:
            yield {"name": item['name'],
                   "brand": item['brandName'],
                   "price": item['price']}

partial output:

{'name': 'Xbox 360 GamePad, Xbox 360 Controller for Windows', 'brand': 'Microsoft', 'price': '1400.00'}
2022-12-20 23:21:30 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.daraz.com.bd/xbox-games/?ajax=true&page=3&spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO>
{'name': 'Pole Bugatt RT 360 12FIT Fishing Rod Hat Chip', 'brand': 'No Brand', 'price': '1020.00'}
2022-12-20 23:21:30 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.daraz.com.bd/xbox-games/?ajax=true&page=3&spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO>
{'name': 'Xbox 360 Controller,USB Wired Controller Gamepad for Microsoft Xbox 360,PC Windowns,XP,Vista,Win7 - Black', 'brand': 'Microsoft', 'price': '1250.00'}
2022-12-20 23:21:30 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.daraz.com.bd/xbox-games/?ajax=true&page=3&spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO>
{'name': '【Seyijian】 1Set RB LB Bumpers Buttons for Microsoft XBox Series X Controller Button Holder RHA', 'brand': 'No Brand', 'price': '452.00'}
2022-12-20 23:21:30 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.daraz.com.bd/xbox-games/?ajax=true&page=3&spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO>
{'name': 'For Xbox One S Slim Internal Power Supply Adapter Replacement N115-120P1A 12V', 'brand': 'No Brand', 'price': '2591.00'}
2022-12-20 23:21:30 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.daraz.com.bd/xbox-games/?ajax=true&page=3&spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO>
{'name': 'DOU Lb Rb Lt Rt Front Bumper Buttons Set Replacement Accessory, Fits for X box Series S X Controllers', 'brand': 'No Brand', 'price': '602.00'}
2022-12-20 23:21:30 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.daraz.com.bd/xbox-games/?ajax=true&page=3&spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO>
{'name': 'IVYUEEN 2 Sets RB LB Bumpers Buttons for XBox Series X S Controller Trigger Button Middle Holder with Screwdriver Tool', 'brand': 'No Brand', 'price': '645.00'}
2022-12-20 23:21:30 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.daraz.com.bd/xbox-games/?ajax=true&page=3&spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO>
{'name': 'Alloy Analog Controller Thumbsticks Replacement Parts Joysticks Analog Sticks for Xbox ONE / PS4 / Switch Controller 11 Pcs', 'brand': 'MOONEYE', 'price': '1544.00'}
2022-12-20 23:21:30 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.daraz.com.bd/xbox-games/?ajax=true&page=3&spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO>
{'name': 'FIFA 21 – Xbox One & Xbox Series X', 'brand': 'No Brand', 'price': '1800.00'}
2022-12-20 23:21:30 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.daraz.com.bd/xbox-games/?ajax=true&page=3&spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO>
{'name': 'Xbox 360 Controller,USB Wired Controller Gamepad for Microsoft Xbox 360,PC Windowns,XP,Vista,Win7 - Black', 'brand': 'No Brand', 'price': '1150.00'}
2022-12-20 23:21:30 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.daraz.com.bd/xbox-games/?ajax=true&page=3&spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO>
{'name': 'Game Consoles Flight Stick Joystick USB Simulator Flight Controller Joystick', 'brand': 'No Brand', 'price': '15179.00'}
2022-12-20 23:21:30 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.daraz.com.bd/xbox-games/?ajax=true&page=3&spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO>
{'name': 'Power Charger Adapter For Microsoft Surfa.6 RT  Charger US Plug', 'brand': 'No Brand', 'price': '964.00'}
2022-12-20 23:21:30 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.daraz.com.bd/xbox-games/?ajax=true&page=3&spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO>
{'name': 'Matrix Glitcher V1 Run Chip Board for Xbox 360/Xbox 360 Slim Motherboard Repair', 'brand': 'No Brand', 'price': '684.00'}
2022-12-20 23:21:30 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.daraz.com.bd/xbox-games/?ajax=true&page=3&spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO>
{'name': 'FORIDE Matrix Glitcher V1 Run Chip Board for Xbox 360/Xbox 360 Slim Motherboard Repair', 'brand': 'No Brand', 'price': '763.00'}
2022-12-20 23:21:30 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.daraz.com.bd/xbox-games/?ajax=true&page=3&spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO>
{'name': 'Matrix Glitcher V1 Run Chip Board for Xbox 360/Xbox 360 Slim Motherboard Repair', 'brand': 'No Brand', 'price': '663.00'}
2022-12-20 23:21:30 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.daraz.com.bd/xbox-games/?ajax=true&page=3&spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO>
{'name': 'Matrix Glitcher V1 Run Chip Board for Xbox 360/Xbox 360 Slim Motherboard Repair', 'brand': 'No Brand', 'price': '739.00'}
2022-12-20 23:21:30 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.daraz.com.bd/xbox-games/?ajax=true&page=3&spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO>
{'name': '5X Matrix Glitcher V3 Corona 48MHZ Crystals IC Chip Repair for Xbox 360 Gaming Console Motherboard', 'brand': 'No Brand', 'price': '2208.00'}
2022-12-20 23:21:30 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.daraz.com.bd/xbox-games/?ajax=true&page=3&spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO>
{'name': 'TP4-005 Smart Turbo Temperature Control 5-Fan For Playstation 4 For PS4', 'brand': 'No Brand', 'price': '1239.00'}
2022-12-20 23:21:30 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.daraz.com.bd/xbox-games/?ajax=true&page=3&spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO>
{'name': 'Stencils Bga Reballing Kit for Xbox Ps3 Chip Reballing Repair Game Consoles Repair Tools Kit', 'brand': 'No Brand', 'price': '1331.00'}
2022-12-20 23:21:30 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.daraz.com.bd/xbox-games/?ajax=true&page=3&spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO>
{'name': 'Preloved Game Kinect xbox 360 CD Cassette Xbox360', 'brand': 'No Brand', 'price': '2138.00'}
2022-12-20 23:21:30 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.daraz.com.bd/xbox-games/?ajax=true&page=3&spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO>
{'name': 'Matrix Glitcher V1 Run Chip Board for Xbox 360/Xbox 360 Slim Repair', 'brand': 'No Brand', 'price': '734'}
2022-12-20 23:21:30 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.daraz.com.bd/xbox-games/?ajax=true&page=3&spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO>
{'name': 'Shadow of the Tomb Raider - Xbox One', 'brand': 'No Brand', 'price': '2800.00'}
2022-12-20 23:21:30 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.daraz.com.bd/xbox-games/?ajax=true&page=3&spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO>
{'name': '5X Matrix Glitcher V3 Corona 48MHZ Crystals IC Chip Repair for Xbox 360 Gaming Console Motherboard', 'brand': 'No Brand', 'price': '2322.00'}
2022-12-20 23:21:30 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.daraz.com.bd/xbox-games/?ajax=true&page=3&spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO>
{'name': '5X Matrix Glitcher V3 Corona 48MHZ Crystals IC Chip Repair for Xbox 360 Gaming Console Motherboard', 'brand': 'No Brand', 'price': '2027.00'}
2022-12-20 23:21:30 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.daraz.com.bd/xbox-games/?ajax=true&page=3&spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO>
{'name': 'Matrix Glitcher V1 Run Chip Board for Xbox 360/Xbox 360 Slim Motheoard Repair', 'brand': 'No Brand', 'price': '649'}
2022-12-20 23:21:30 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.daraz.com.bd/xbox-games/?ajax=true&page=3&spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO>
{'name': 'XBOX 360 GAMES - DANCE CENTRAL 3 (KINECT REQUIRED) (FOR MOD /JAILBREAK CONSOLE)', 'brand': 'No Brand', 'price': '1485.00'}
2022-12-20 23:21:30 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.daraz.com.bd/xbox-games/?ajax=true&page=3&spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO>
{'name': 'Kontrol Freek Call Of Duty Black Ops 4 Xbox One Series S-X', 'brand': 'No Brand', 'price': '810.00'}
2022-12-20 23:21:30 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.daraz.com.bd/xbox-games/?ajax=true&page=3&spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO>
{'name': 'Hitman 2 - Xbox One', 'brand': 'No Brand', 'price': '2500.00'}
2022-12-20 23:21:30 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.daraz.com.bd/xbox-games/?ajax=true&page=3&spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO>
{'name': 'Red Dead Redemption 2 XBOX ONE', 'brand': 'No Brand', 'price': '3800.00'}
2022-12-20 23:21:30 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.daraz.com.bd/xbox-games/?ajax=true&page=3&spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO>
{'name': 'Wired Gaming Headphones Bass Stereo Headsets with Mic for PS4 for XBOX-ONE', 'brand': 'No Brand', 'price': '977.00'}
2022-12-20 23:21:30 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.daraz.com.bd/xbox-games/?ajax=true&page=3&spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO>
{'name': '10X Matrix Glitcher V3 Corona 48MHZ Crystals IC Chip Repair for Xbox 360 Gaming Console Motheoard', 'brand': 'No Brand', 'price': '3615'}
2022-12-20 23:21:30 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.daraz.com.bd/xbox-games/?ajax=true&page=3&spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO>
{'name': 'Matrix Glitcher V1 Run Chip Board for Xbox 360/Xbox 360 Slim Repair', 'brand': 'No Brand', 'price': '739.00'}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM