簡體   English   中英

Scrapy 劇作家只有第一頁被刮掉了

[英]Scrapy playwright only first page is scraped

我將 scrapy 與 scrapy_playwright (python) 一起使用。 當我抓取一個頁面時,它成功地從第一頁提取鏈接,然后創建更多頁面,但這些頁面沒有任何反應,它們不會被抓取。 蜘蛛剛剛關閉 有誰知道為什么?

這是代碼:

class ClientSideSiteSpider(CrawlSpider):
    name = "client-side-site"
    handle_httpstatus_list = [301, 302, 401, 403, 404, 408, 429, 500, 503]
    exclude_patterns: List[str] = []

    custom_settings = {
        "TWISTED_REACTOR": "twisted.internet.asyncioreactor.AsyncioSelectorReactor",
        "DOWNLOAD_HANDLERS": {
            "https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
        },
        "ITEM_PIPELINES": {
            # more stuff...
        },
        "DOWNLOADER_MIDDLEWARES": {
            # more stuff...
        },
        "PLAYWRIGHT_LAUNCH_OPTIONS": {
            "proxy": {
                "server": os.environ.get("PROXY_TR_SERVER"),
                "username": os.environ.get("PROXY_TR_USER"),
                "password": os.environ.get("PROXY_TR_PASSWORD"),
            },
        }
    }

    playwright_meta = {
        "playwright": True,
        "playwright_include_page": True,
        "playwright_page_methods": [
            PageMethod("wait_for_timeout", 10000),
        ],
    }

    def __init__(
        self,
        start_url: str,
        # here there is some more stuff...,
        **kwargs: Any
    ):
        self.start_urls: List[str] = [start_url]
        # boring initializations removed...

        url_parsed = urlparse(start_url)
        allow_path = url_parsed.path
        self.rules = (
            Rule(
                LinkExtractor(allow=allow_path),
                callback="parse_item",
                follow=True,
            ),
        )

        super().__init__(**kwargs)

    def start_requests(self) -> Iterator[Request]:
        for url in self.start_urls:
            yield Request(url, meta=self.playwright_meta)

    def parse_start_url(self, response: Response) -> Dict[str, Any]:
        return self.parse_item(response)

    def parse_item(self, response: Response) -> Dict[str, Any]:
        return {
            "status": response.status,
            "file_urls": [response.url],
            "body": response._get_body(),
            "type": response.headers.get("Content-Type", ""),
            "latency": response.meta.get("download_latency"),
        }

    def process_request(self, request: Request):
        """ adding playwright headers to all requests... necessary? """
        request.meta.update(self.playwright_meta)
        return request

在日志中,我看到第一頁已成功抓取(並且所有鏈接都被跟蹤),但以下頁面不是。

第一頁:

2022-05-12 14:28:14 [scrapy-playwright] DEBUG: Browser context started: 'default'
2022-05-12 14:28:14 [scrapy-playwright] DEBUG: [Context=default] New page created, page count is 1 (1 for all contexts)
2022-05-12 14:28:14 [scrapy-playwright] DEBUG: [Context=default] Request: <GET https://discountcasino266.com/> (resource type: document, referrer: None)
2022-05-12 14:28:15 [scrapy-playwright] DEBUG: [Context=default] Response: <200 https://discountcasino266.com/> (referrer: None)
2022-05-12 14:28:15 [scrapy-playwright] DEBUG: [Context=default] Request: <GET https://discountcasino266.com/runtime-es2015.3896a8c3776f78100458.js> (resource type: script, referrer: https://discountcasino266.com/)
2022-05-12 14:28:15 [scrapy-playwright] DEBUG: [Context=default] Request: <GET https://discountcasino266.com/polyfills-es2015.77ed2742568a17467b11.js> (resource type: script, referrer: https://discountcasino266.com/)
2022-05-12 14:28:15 [scrapy-playwright] DEBUG: [Context=default] Request: <GET https://discountcasino266.com/main-es2015.33ff46eac5dca0dd9807.js> (resource type: script, referrer: https://discountcasino266.com/)
2022-05-12 14:28:15 [scrapy-playwright] DEBUG: [Context=default] Request: <GET https://discountcasino266.com/styles.d715a958203282df90b1.css> (resource type: stylesheet, referrer: https://discountcasino266.com/)
2022-05-12 14:28:15 [scrapy-playwright] DEBUG: [Context=default] Response: <200 https://discountcasino266.com/polyfills-es2015.77ed2742568a17467b11.js> (referrer: None)
2022-05-12 14:28:15 [scrapy-playwright] DEBUG: [Context=default] Response: <200 https://discountcasino266.com/main-es2015.33ff46eac5dca0dd9807.js> (referrer: None)
2022-05-12 14:28:15 [scrapy-playwright] DEBUG: [Context=default] Response: <200 https://discountcasino266.com/runtime-es2015.3896a8c3776f78100458.js> (referrer: None)
2022-05-12 14:28:15 [scrapy-playwright] DEBUG: [Context=default] Request: <GET https://discountcasino266.com/6051-es2015.0d363775a5eb43bd3a29.js> (resource type: script, referrer: https://discountcasino266.com/)
....

以下頁面:

2022-05-12 14:28:17 [scrapy-playwright] DEBUG: [Context=default] New page created, page count is 2 (2 for all contexts)
2022-05-12 14:28:17 [scrapy-playwright] DEBUG: [Context=default] New page created, page count is 3 (3 for all contexts)
2022-05-12 14:28:17 [scrapy-playwright] DEBUG: [Context=default] New page created, page count is 4 (4 for all contexts)
2022-05-12 14:28:17 [scrapy-playwright] DEBUG: [Context=default] New page created, page count is 5 (5 for all contexts)
2022-05-12 14:28:17 [scrapy-playwright] DEBUG: [Context=default] New page created, page count is 6 (6 for all contexts)
2022-05-12 14:28:17 [scrapy-playwright] DEBUG: [Context=default] New page created, page count is 7 (7 for all contexts)
2022-05-12 14:28:17 [scrapy-playwright] DEBUG: [Context=default] New page created, page count is 8 (8 for all contexts)
2022-05-12 14:28:17 [scrapy-playwright] DEBUG: [Context=default] New page created, page count is 9 (9 for all contexts)
2022-05-12 14:28:18 [scrapy.core.engine] INFO: Closing spider (finished)

嘗試在start_requests中添加callback=self.parse_start_url ,如下所示:

def start_requests(self) -> Iterator[Request]:
    for url in self.start_urls:
        yield Request(
            url, 
            callback=self.parse_start_url,
            meta=self.playwright_meta
        )

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM