简体   繁体   English

使用scrapy抓取网络数据的困难

[英]Difficulty in web-scraping data using scrapy

I am trying to scrape data using scrapy from https://www.ta.com/portfolio/business-services , however the response is NULL.我正在尝试使用https://www.ta.com/portfolio/business-services 中的scrapy 抓取数据,但是响应为 NULL。 I am looking to scrape href in div.tiles js-portfolio-tiles using the code response.css("div.tiles.js-portfolio-tiles a::attr(href)").extract() I think this has something to do with ::before that appears just before this, but maybe not.我正在寻找使用代码response.css("div.tiles.js-portfolio-tiles a::attr(href)").extract()在 div.tiles js-portfolio-tiles 中抓取 href 我认为这有什么与::before ,它出现在此之前,但也许不是。 How do I go about extracting this?我该如何提取这个? website HTML网站 HTML

The elements that you are interested in retrieving are loaded by your browser using javascript.您有兴趣检索的元素由您的浏览器使用 javascript 加载。 By default scrapy is not able to load elements using javascript as it is not a browser, it simply retrieves the raw HTML.默认情况下,scrapy 无法使用 javascript 加载元素,因为它不是浏览器,它只是检索原始 HTML。

Scrapy shell is an invaluable tool for inspecting what is available in the response that scrapy receives. Scrapy shell 是一个非常有用的工具,用于检查 scrapy 收到的响应中的可用内容。

This set of commands will open the response in your default web browser:这组命令将在您的默认 Web 浏览器中打开响应:

$ scrapy shell
>>> fetch("https://www.ta.com/portfolio/business-services")
>>> view (response)

As you can see the js-portfolio tiles are not visible as they have not been loaded.如您所见,js-portfolio 磁贴不可见,因为它们尚未加载。

I have had a look at the AJAX requests in the network panel of the developer tools and it appears that the information you require may be available in an XHR request.我在开发者工具的网络面板中查看了 AJAX 请求,看起来您需要的信息可能在 XHR 请求中可用。 If it is not then you will need to use additional software to load the javascript, namely scrapy splash or selenium, I would advise exploring the AJAX (XHR) request first though as this will be much faster and easier.如果不是,那么您将需要使用其他软件来加载 javascript,即 scrapy splash 或 selenium,我建议先探索 AJAX (XHR) 请求,因为这会更快更容易。

See this question for additional details on using your browsers dev tools to inspect AJAX requests.有关使用浏览器开发工具检查 AJAX 请求的更多详细信息,请参阅此问题

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM