Scrapy-Splash等待页面加载

Question

我刚开始抓取和启动，需要从单页和常规Web应用程序中抓取数据。

需要注意的是，我主要是从内部工具和应用程序中抓取数据，因此有些工具需要身份验证，并且所有这些工具都需要至少几秒钟的加载时间才能完全加载页面。

我天真地尝试了一个Python time.sleep（seconds） ，它没有用。 基本上看起来像SplashRequest和scrapy.Request都可以运行并产生结果。 然后，我了解了LUA脚本作为这些请求的参数，并尝试了使用各种形式的wait（）的LUA脚本，但是看起来请求实际上从未运行过LUA脚本。 它立即完成，我的HTMl选择器没有找到我想要的东西。

我正在从这里https://github.com/scrapy-plugins/scrapy-splash按照说明进行操作，并使它们的docker实例在localhost：8050上运行并创建了settings.py。

任何有经验的人都知道我可能会缺少什么？

谢谢！

蜘蛛

# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy_splash import SplashRequest
import logging
import base64
import time
# from selenium import webdriver

# lua_script="""
# function main(splash)
#     splash:set_user_agent(splash.args.ua)
#     assert(splash:go(splash.args.url))
#     splash:wait(5)

#     -- requires Splash 2.3  
#     -- while not splash:select('#user-form') do
#     -- splash:wait(5)
#     -- end
#     repeat
#         splash:wait(5))
#     until( splash:select('#user-form') ~= nil )

#     return {html=splash:html()}
# end
# """

load_page_script="""
    function main(splash)
        splash:set_user_agent(splash.args.ua)
        assert(splash:go(splash.args.url))
        splash:wait(5)

        function wait_for(splash, condition)
            while not condition() do
                splash:wait(0.5)
            end
        end

        local result, error = splash:wait_for_resume([[
            function main(splash) {
                setTimeout(function () {
                    splash.resume();
                }, 5000);
            }
        ]])

        wait_for(splash, function()
            return splash:evaljs("document.querySelector('#user-form') != null")
        end)

        -- repeat
        -- splash:wait(5))
        -- until( splash:select('#user-form') ~= nil )

        return {html=splash:html()}
    end
"""

class HelpSpider(scrapy.Spider):
    name = "help"
    allowed_domains = ["secet_internal_url.com"]
    start_urls = ['https://secet_internal_url.com']

    # http_user = 'splash-user'
    # http_pass = 'splash-password'


    def start_requests(self):
        logger = logging.getLogger()
        login_page = 'https://secet_internal_url.com/#/auth'

        splash_args = {
            'html': 1,
            'png': 1,
            'width': 600,
            'render_all': 1,
            'lua_source': load_page_script
        }

        #splash_args = {
        #    'html': 1,
        #    'png': 1,
        #   'width': 600,
        #   'render_all': 1,
        #    'lua_source': lua_script
        #}

        yield SplashRequest(login_page, self.parse, endpoint='execute', magic_response=True, args=splash_args)

    def parse(self, response):
        # time.sleep(10)
        logger = logging.getLogger()

        html = response._body.decode("utf-8") 

        # Looking for a form with the ID 'user-form'
        form = response.css('#user-form')

        logger.info("####################")
        logger.info(form)
        logger.info("####################")

Answer 1

我想到了！

简短答案

我的Spider类配置错误，无法使用scrapy飞溅。

长答案

在我的案例中，运行报废的一部分是运行本地Docker实例，该实例用于将我的请求加载到其中以运行Lua脚本。 需要注意的一个重要警告是，如github页中所述，飞溅的设置必须是spider类本身的属性 ，因此我将以下代码添加到了Spider中：

custom_settings = {
    'SPLASH_URL': 'http://localhost:8050',
    # if installed Docker Toolbox: 
    #  'SPLASH_URL': 'http://192.168.99.100:8050',
    'DOWNLOADER_MIDDLEWARES': {
        'scrapy_splash.SplashCookiesMiddleware': 723,
        'scrapy_splash.SplashMiddleware': 725,
        'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
    },
    'SPIDER_MIDDLEWARES': {
        'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
    },
    'DUPEFILTER_CLASS': 'scrapy_splash.SplashAwareDupeFilter',
}

然后我注意到我的Lua代码正在运行，并且Docker容器日志指示了交互。 在使用splash：select（）修复错误之后，我的登录脚本可以正常工作，而我的等待也可以：

splash:wait( seconds_to_wait )

最后，我创建了一个Lua脚本来处理登录，重定向以及从页面收集链接和文本的过程。 我的应用程序是AngularJS应用程序，因此除单击外，我无法收集链接或访问它们。 这个脚本让我遍历每个链接，单击它并收集内容。

我想一个替代解决方案是使用端到端测试工具，例如Selenium / WebDriver或Cypress，但是我更喜欢使用scrapy进行抓取和测试工具进行测试。 我想对于每个人（Python或NodeJS工具）。

整洁的把戏

值得一提的另一件事是对调试确实有帮助，当您为Scrapy-Splash运行Docker实例时，您可以在浏览器中访问该URL，并且有一个交互式“请求测试器”，可以让您测试Lua脚本并查看渲染的内容。 HTML结果（例如，验证登录或页面访问）。 对我而言，该URL为http://0.0.0.0:8050 ，并且该URL是在您的设置中设置的，并且应配置为与您的Docker容器匹配。

干杯!

Scrapy-Splash等待页面加载

问题描述

1 个解决方案

解决方案1
0 已采纳 2019-09-04 23:01:28

Scrapy-Splash等待页面加载

问题描述

1 个解决方案

解决方案1 0 已采纳 2019-09-04 23:01:28

解决方案1
0 已采纳 2019-09-04 23:01:28