简体   繁体   English

在scrapy中刮掉disqus评论计数的最佳方法是什么?

[英]What's the best way to scrape disqus comment count in scrapy?

I'm just getting started with scrapy and am interested in the best practices for this situation.我刚刚开始使用scrapy,并且对这种情况的最佳实践感兴趣。 Scrapy is designed to select elements on the page using either CSS or XPath. Scrapy 旨在使用 CSS 或 XPath 选择页面上的元素。 Disqus comments appear to load in iFrame making them harder to scrape. Disqus 评论似乎在 iFrame 中加载,使其更难抓取。 I know they have an API, but is there a way to scrape them using xpath/css or some other easy selector?我知道他们有一个 API,但是有没有办法使用 xpath/css 或其他一些简单的选择器来抓取它们?

Here's an example post: http://www.ibtimes.com/who-aaron-ybarra-suspected-seattle-pacific-university-shooter-obsessed-columbine-1595326这是一个示例帖子: http ://www.ibtimes.com/who-aaron-ybarra-suspected-seattle-pacific-university-shooter-obsessed-columbine-1595326

I tried just using the xpath of Disqus comments count, but that didn't appear to work.我尝试只使用 Disqus 评论计数的 xpath,但这似乎不起作用。

In [36]: sel.xpath('//*[@id="main-nav"]/nav/ul/li[1]/a/span[1]').extract()
Out[36]: []

Is there some other way to get the count?有没有其他方法可以得到计数? What is the best strategy here?这里最好的策略是什么?

Disqus is in an iframe object on third party websites. Disqus 位于第三方网站的 iframe 对象中。 By accessing the "src" in iframe, you can follow the link and then proceed as normal.通过访问 iframe 中的“src”,您可以点击链接,然后正常进行。

You would need to use a headless browser.您将需要使用无头浏览器。 Try importing modules such as scrapy-selenium尝试导入诸如scrapy-selenium类的模块

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 使用scrapy抓取多个域的最佳方法是什么? - what is the best way to scrape multiple domains with scrapy? 刮擦篮球运动员球队名称的最佳方法是什么? - What is the best way to scrape the basketball player's team name? 在具有 Python 的网站上抓取和 plot 连接页面的最佳方法是什么? - What's the best way to scrape and plot connected pages on a website with Python? 在 scrapy 中禁用图像下载的最佳方法是什么? - What's the best way to disable image download in scrapy? 计算字母数字单词的最佳方法是什么? - What's the best way to count alphanumeric words? 有没有最好的方法来抓取同一域中不同结构的多个页面? - Is there best way to scrape multiple pages in different structure in same domain with scrapy? Python:使用Scrapy脚本-这是从论坛中抓取网址的最佳方法吗? - Python: With Scrapy Script- Is this the best way to scrape urls from forums? 抓取该网站的最佳方法是什么? (不是硒) - What would be the best way to scrape this website? (Not Selenium) 使用Hadoop统计独特访客的最佳方式是什么? - What's the best way to count unique visitors with Hadoop? 在python中验证注释的最佳正则表达式是什么? - What's the best regex for validating a comment in python?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM