简体   繁体   English

Scrapy - Splash 获取动态数据

[英]Scrapy - Splash fetch dynamic data

I am trying to fetch dynamic phone number from this page (among others): https://www.europages.fr/LEMMERFULLWOOD-GMBH/DEU241700-00101.html我正在尝试从此页面(以及其他)获取动态电话号码: https : //www.europages.fr/LEMMERFULLWOOD-GMBH/DEU241700-00101.html

The phone number appears after a click on the element div with the class page-action click-tel .在使用类page-action click-tel元素 div 后,电话号码出现。 I am trying to get to this data with scrapy_splash using a LUA script to execute a click.我正在尝试使用 Scrapy_splash 使用 LUA 脚本执行单击来获取此数据。

After pulling splash on my ubuntu:在我的 ubuntu 上拉飞溅后:

sudo docker run -d -p 8050:8050 scrapinghub/splash

Here is my code so far (I am using a proxy service) :到目前为止,这是我的代码(我正在使用代理服务):

class company(scrapy.Spider):
    name = "company"
    custom_settings = {
        "FEEDS" : {
            '/home/ubuntu/scraping/europages/data/company.json': {
                'format': 'jsonlines',
                'encoding': 'utf8'
            }
        },
        "DOWNLOADER_MIDDLEWARES" : { 
            'scrapy_splash.SplashCookiesMiddleware': 723, 
            'scrapy_splash.SplashMiddleware': 725, 
            'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810, 
        },
        "SPLASH_URL" : 'http://127.0.0.1:8050/',
        "SPIDER_MIDDLEWARES" : { 
            'scrapy_splash.SplashDeduplicateArgsMiddleware': 100, 
        },
        "DUPEFILTER_CLASS" : 'scrapy_splash.SplashAwareDupeFilter',
        "HTTPCACHE_STORAGE" : 'scrapy_splash.SplashAwareFSCacheStorage'

    }
    allowed_domains = ['www.europages.fr']

    def __init__(self, company_url):
        self.company_url = "https://www.europages.fr/LEMMERFULLWOOD-GMBH/DEU241700-00101.html" ##forced
        self.item = company_item()
        self.script = """
            function main(splash)
                splash.private_mode_enabled = false
                assert(splash:go(splash.args.url))
                assert(splash:wait(0.5))
                local element = splash:select('.page-action.click-tel') 
                local bounds = element:bounds()
                element:mouse_click{x=bounds.width/2, y=bounds.height/2}
                splash:wait(4)
                return splash:html()
            end
        """
            
    def start_requests(self):
        yield scrapy.Request(
            url = self.company_url,
            callback = self.parse,
            dont_filter = True,
            meta = {
                    'splash': {
                        'endpoint': 'execute',
                        'url': self.company_url,
                        'args': {
                            'lua_source': self.script,
                            'proxy': 'http://usernamepassword@proxyhost:port',
                            'html':1,
                            'iframes':1

                        }
                    }   
            }
        )
    def parse(self, response):
        soup = BeautifulSoup(response.body, "lxml")
        print(soup.find('div',{'class','page-action click-tel'}))

The problem is that it has no effect, I still have nothing as if no button were clicked.问题是它没有效果,我仍然什么都没有,就好像没有点击按钮一样。

Shouldn't the return splash:html() return the results of element:mouse_click{x=bounds.width/2, y=bounds.height/2} (as element:mouse_click() waits for the changes to appear) in response.body ?不应该return splash:html()返回element:mouse_click{x=bounds.width/2, y=bounds.height/2} (因为element:mouse_click()等待更改出现)作为response.body

Am I missing something here ?我在这里错过了什么吗?

Most times when sites load data dynamically, they do so via background XHR requests to the server.大多数情况下,当站点动态加载数据时,它们是通过向服务器发送后台 XHR 请求来实现的。 A close examination of the network tab when you click the 'telephone' button, shows that the browser sends an XHR request to the url https://www.europages.fr/InfosTelecomJson.json?uidsid=DEU241700-00101&id=1330 .单击“电话”按钮时仔细检查网络选项卡,会发现浏览器向 URL https://www.europages.fr/InfosTelecomJson.json?uidsid=DEU241700-00101&id=1330发送了 XHR 请求。 You can emulate the same in your spider and avoid using scrapy splash altogether.你可以在你的蜘蛛中模仿同样的东西,避免完全使用scrapy splash。 See sample implementation below using one url:请参阅下面使用一个 url 的示例实现:

import scrapy
from urllib.parse import urlparse

class Company(scrapy.Spider):
    name = 'company'
    allowed_domains = ['www.europages.fr']
    start_urls = ['https://www.europages.fr/LEMMERFULLWOOD-GMBH/DEU241700-00101.html']

    def parse(self, response):
        # obtain the id and uuid to make xhr request
        uuid = urlparse(response.url).path.split('/')[-1].rstrip('.html')
        id = response.xpath("//div[@itemprop='telephone']/a/@onclick").re_first(r"event,'(\d+)',")
        yield scrapy.Request(f"https://www.europages.fr/InfosTelecomJson.json?uidsid={uuid}&id={id}", callback=self.parse_address)

    def parse_address(self, response):
        yield response.json()

I get the response我得到回应

{'digits': '+49 220 69 53 30'}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM