简体   繁体   English

无法使用样式化组件 javascript 抓取网站

[英]Unable to scrape a website with styled-component javascript

My goal我的目标

Get basic informations from this page with using Scrapy framework, but question is no specific to this framework.使用 Scrapy 框架从此页面获取基本信息,但问题不针对此框架。 Let's take the p element inside the h1 node for exemple.我们以h1节点内的p元素为例。

Issue问题

All the selections I make with the response I get from my Scrapy requests are failing to return what's inside the h1 node.我从 Scrapy 请求中获得的响应所做的所有选择都未能返回h1节点内的内容。

scrapy shell 'url'
response
>>> 200
response.xpath('//h1/p')
>>> []
Fetching the response: 获取响应:

When fetching the response, I see a structure i can't really understand with all the main html markup condensed and placed just after a bunch of javascript styled-components.在获取响应时,我看到一个我无法真正理解的结构,所有主要的 html 标记都压缩并放置在一堆 javascript 样式组件之后。 The file is here (ligne 1725). 文件在这里(1725 线)。

My process我的过程

Testing the selector from dev-tool: 从开发工具测试选择器:

After disabling Javascript from the dev tools and testing my selector, I get the desired result.从开发工具中禁用 Javascript并测试我的选择器后,我得到了想要的结果。 For exemple I get the <p> element inside the <h1> with a simple query //h1/p from the console.例如,我通过控制台的简单查询//h1/p获取<h1>内的<p>元素。

testing the selector with scrapy shell: 使用 scrapy shell 测试选择器:

Not working, see Issue不工作,请参阅问题

testing the selector with splash: 用 splash 测试选择器:

I get the exact same result as shown in the issue.我得到与问题中所示完全相同的结果。

I can't explain the error, but I can hopefull provide an answer to your problem我无法解释该错误,但我可以为您的问题提供答案

response.xpath('//*[@class="summary__StyledAddress-e4c4ok-6 zWwUF textIntent-title1"]/text()').get()

returns: '12-14 31st Avenue, Unit 2 '返回:'12-14 31st Avenue, Unit 2'

Which is hopefully what you need?希望哪一个是你需要的?

Dr P. P博士

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM