简体   繁体   English

请求因 504 失败:在 docker 中使用 scrapy-splash 时网关超时

[英]Requests fail with 504: Gateway Time-out when using scrapy-splash in docker compose with zyte

I'm trying to scrape one site which partially renders content using JS.我正在尝试抓取一个使用 JS 部分呈现内容的站点。

I went ahead and found this project: https://github.com/scrapinghub/sample-projects/tree/master/splash_smart_proxy_manager_example , which quite neatly explains how to set things out.我继续找到了这个项目: https://github.com/scrapinghub/sample-projects/tree/master/splash_smart_proxy_manager_example ,它非常巧妙地解释了如何设置。 Here's what I have right now:这是我现在拥有的:

Docker compose: Docker 组成:

version: '3.8'

services:
    scraping:
        build:
            context: .
            dockerfile: Dockerfile
        volumes:
            - "./scraping:/scraping"
        environment:
            - PYTHONUNBUFFERED=1
        depends_on:
            - splash
        links:
            - splash
    splash:
        image: scrapinghub/splash
        restart: always
        expose:
            - 5023
            - 8050
            - 8051
        ports:
            - "5023:5023"
            - "8050:8050"
            - "8051:8051"

spider:蜘蛛:

class HappySider(scrapy.Spider):
    ...
    custom_settings = {
        'DUPEFILTER_CLASS': 'scrapy_splash.SplashAwareDupeFilter',
        'SPIDER_MIDDLEWARES': {
            'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
        },
        'DOWNLOADER_MIDDLEWARES': {
            'scrapy_splash.SplashCookiesMiddleware': 723,
            'scrapy_splash.SplashMiddleware': 725,
            'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
        },
        'ITEM_PIPELINES': {
            'scraping.pipelines.HappySpiderPipeline': 300,
        },
        'RETRY_HTTP_CODES': [500, 502, 503, 504, 522, 524, 408, 429, 403],
        'RETRY_TIMES': 20,
        'DOWNLOAD_DELAY': 5,
        'DOWNLOAD_TIMEOUT': 30,
        'CONCURRENT_REQUESTS': 1,
        'CONCURRENT_REQUESTS_PER_DOMAIN': 1,
        'COOKIES_ENABLED': False,
        'ROBOTSTXT_OBEY': True,
        # enable Zyte Proxy
        'ZYTE_SMARTPROXY_ENABLED': True,
        # the APIkey you get with your subscription
        'ZYTE_SMARTPROXY_APIKEY': '<my key>',
        'SPLASH_URL': 'http://splash:8050/',
    }

    def __init__(self, testing=False, name=None, **kwargs):
        self.LUA_SOURCE = get_data(
            'scraping', 'scripts/smart_proxy_manager.lua'
        ).decode('utf-8')
        super().__init__(name, **kwargs)

    def start_requests(self):

        yield SplashRequest(
            url='https://www.someawesomesi.te',
            endpoint='execute',
            args={
                'lua_source': self.LUA_SOURCE,
                'crawlera_user': self.settings['ZYTE_SMARTPROXY_APIKEY'],
                'timeout': 90,
            },
            # tell Splash to cache the lua script, to avoid sending it for every request
            cache_args=['lua_source'],
            meta={
                'max_retry_times': 10,
            },
            callback=self.my_callback
        )

And the output I get is:我得到的 output 是:

2022-08-10 13:09:32 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.someawesomesi.te via http://splash:8050/execute> (failed 1 times): 504 Gateway Time-out

Not sure how to proceed here.不知道如何在这里进行。 I did look out why it would be giving 504 to me and splash docks does suggest some ways of handling it... but I don't have many concurrent URLs and the script fails with the very first one.我确实了解了为什么它会给我 504 并且splash docks 确实提出了一些处理它的方法......但是我没有很多并发URL,并且脚本在第一个URL 时失败。 Plus, the site I'm scraping is very fast, and if I just use Zyte without splash, then it scrapes very fast.另外,我正在抓取的网站非常快,如果我只使用 Zyte 没有飞溅,那么它的抓取速度非常快。

So If anybody can suggest what's wrong here and how to fix it - I'd greatly appreciate it.因此,如果有人可以建议这里出了什么问题以及如何解决它 - 我将不胜感激。

This example did not work out of the box for me either.这个例子对我来说也不是开箱即用的。 Changing Zyte Smart Proxy Manager's port number specified in splash_smart_proxy_manager_example/scripts/smart_proxy_manager.lua to 8010 helped.将 splash_smart_proxy_manager_example /scripts/smart_proxy_manager.lua中指定的 Zyte 智能代理管理器的端口号更改为 8010 会有所帮助。

local port = 8010

8010 was used in the older example 旧示例中使用了 8010

Splash is getting deprecated soon. Splash 即将被弃用。 You can use headless browser libraries for rendering JS along with Smart Proxy Manager.您可以使用无头浏览器库和智能代理管理器来渲染 JS。 Zyte recently launched three headless browser libraries. Zyte 最近推出了三个无头浏览器库。

  1. Zyte SmartProxy Puppeteer. Zyte SmartProxy Puppeteer。
  2. Zyte SmartProxy Playwright. Zyte SmartProxy 剧作家。
  3. Zyte SmartProxy Selenium. Zyte SmartProxy Selenium。

These client libraries are built on top of their native libraries for web automation across Chromium, Firefox, and WebKit, written to work seamlessly with Zyte Smart Proxy Manager.这些客户端库构建在其原生库之上,用于跨 Chromium、Firefox 和 WebKit 的 web 自动化,编写为与 Zyte 智能代理管理器无缝协作。 Using these library, you will no longer have to maintain a separate piece of software(like splash) running in the background to help connect with Zyte Smart Proxy Manager.使用这些库,您将不再需要维护在后台运行的单独软件(如 splash)来帮助连接 Zyte Smart Proxy Manager。

  1. My recommendation would be to use Zyte API .我的建议是使用Zyte API Zyte API is an end-to-end API solution that executes all tasks in the web-scraping sequence. Zyte API 是一个端到端的 API 解决方案,它执行网络抓取序列中的所有任务。 It can extract dynamically-loaded web page content without spending time recreating what the browser does through JavaScript, headless browser libraries and additional requests.Just Set javascript parameter: to它可以提取动态加载的 web 页面内容,而无需花费时间通过 JavaScript、无头浏览器库和其他请求重新创建浏览器所做的事情。只需将javascript参数设置为:
    Turn JavaScript ON or OFF during browser rendering.在浏览器渲染期间打开或关闭 JavaScript。 And it just works...它只是工作......

I work as a Developer Advocate @zyte.我是一名开发者倡导者@zyte。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM