简体   繁体   English

如何在没有看到要刮擦的代码的网页中以刮擦的方式获取数据

[英]How do I obtain data, with scrapy, in a web page in which I do not see that there is the code I want to scrape

I'm trying to get the names of the users and the content of the comments that exist on this page : 我正在尝试获取用户的名称和此页面上存在的评论的内容:

User and text that I need to extract: 我需要提取的用户和文本: 图片

When I test the extraction with the chrome plugin Xpath helper , I am getting the user names with the statement: 当我使用chrome插件Xpath helper测试提取时,我正在使用以下语句获取用户名:

//*[@id="livefyre"]/div/div/div/div/article/div/header/a/span

and the comments, I get them with: 和评论,我得到他们:

//*[@id="livefyre"]/div/div/div/div/article/div/section/div/p

When I do the test in the scrapy console, with the query: 当我在scrapy控制台中执行测试时,出现以下查询:

response.xpath(//*[@id="livefyre"]/div/div/div/div/article/div/section/div/p).extract()

I get a [] ; 我得到一个[]

I've also tried with: 我也尝试过:

response.xpath (//*[@id="livefyre"]/div/div/div/div/article/div/section/div/p.text()).extract()

The same thing happens with my code. 我的代码也发生了同样的事情。

Verifying the code of the page, I see that all those comments do not exist in the html code. 验证页面的代码后,我发现html代码中不存在所有这些注释。

When I inspect the page, for example, I see the comment text: 例如,当我检查页面时,会看到注释文本: 图片

But when, I check the html code of the page I do not see anything : 但是,当我检查页面的html代码时,我什么都没有看到: 图片

Where am I making a mistake? 我在哪里出错?

Thanks for help. 感谢帮助。

As you stated, there isn't any comment in the code of page, that mean website is being rendered through javascript, There are two ways you can scrap these kind of websites 如您所述,页面代码中没有任何注释,这意味着网站是通过javascript呈现的。有两种方法可以删除此类网站

First, 第一,

use scrapy-splash to render javascript 使用scrapy-splash呈现javascript

second, 第二,

find the api/network call that brings the comments, mock that request in scrapy to get your data. 找到带来评论的api/network call ,草率地模拟该请求以获取您的数据。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何在 Scrapy 教程中抓取“下一页”? - How do I scrape the 'next' page in the Scrapy Tutorial? 如何使用scrapy重定向到结果页面并从那里刮掉? - How do I redirect to a result page and scrape from there using scrapy? 我如何刮到csv中的csv - How do I scrape to csv in scrapy 如何抓取 X 秒后加载的 web 页面? - How do I scrape a web page that loads after X seconds? 我如何网络抓取此链接并遍历页码? - How do I web scrape this link and iterate through the page numbers? 像我在首页中一样,如何抓取下一页数据? - How to scrape next page data as i do in the first page? 如何从通过 selenium 和 python 提交数据后刷新的网页中抓取数据? - How do I scrape data from a web page that refreshes after submitting data via selenium and python? 在解析爬虫蜘蛛中的 URL 之前,如何抓取表示网站中最大页面数的数字? - How do I scrape the number which denotes the maximum number of pages in a website before parsing a URL in a scrapy spider? Web 抓取 - “href”链接未完全显示,我无法抓取我想要在这些数据中进行的处理 - Web scraping - "href" links are not shown completely and I can't scrape for the processing that I want to do in these data 如何从该网页上的Google文档表中抓取数据? - How do I scrape the data from the Google Docs table on this web page?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM