[英]How can Scrapy deal with Javascript
Spider for reference: 蜘蛛参考:
import scrapy
from scrapy.spiders import Spider
from scrapy.selector import Selector
from script.items import ScriptItem
class RunSpider(scrapy.Spider):
name = "run"
allowed_domains = ["stopitrightnow.com"]
start_urls = (
'http://www.stopitrightnow.com/',
)
def parse(self, response):
for widget in response.xpath('//div[@class="shopthepost-widget"]'):
#print widget.extract()
item = ScriptItem()
item['url'] = widget.xpath('.//a/@href').extract()
url = item['url']
#print url
yield item
When I run this the output in terminal is as follows: 当我运行它时,终端输出如下:
2015-08-21 14:23:51 [scrapy] DEBUG: Scraped from <200 http://www.stopitrightnow.com/>
{'url': []}
<div class="shopthepost-widget" data-widget-id="708473">
<script type="text/javascript">!function(d,s,id){var e, p = /^http:/.test(d.location) ? 'http' : 'https';if(!d.getElementById(id)) {e = d.createElement(s);e.id = id;e.src = p + '://' + 'widgets.rewardstyle.com' + '/js/shopthepost.js';d.body.appendChild(e);}if(typeof window.__stp === 'object') if(d.readyState === 'complete') {window.__stp.init();}}(document, 'script', 'shopthepost-script');</script><br>
This is the html: 这是html:
<div class="shopthepost-widget" data-widget-id="708473" data-widget-uid="1"><div id="stp-55d44feabd0eb" class="stp-outer stp-no-controls">
<a class="stp-control stp-left stp-hidden"><</a>
<div class="stp-inner" style="width: auto">
<div class="stp-slide" style="left: -0%">
<a href="http://rstyle.me/iA-n/zzhv34c_" target="_blank" rel="nofollow" class="stp-product " data-index="0" style="margin: 0 0px 0 0px">
<span class="stp-help"></span>
<img src="//images.rewardstyle.com/img?v=2.13&p=n_24878713">
</a>
<a href="http://rstyle.me/iA-n/zzhvw4c_" target="_blank" rel="nofollow" class="stp-product " data-index="1" style="margin: 0 0px 0 0px">
<span class="stp-help"></span>
<img src="//images.rewardstyle.com/img?v=2.13&p=n_24878708">
To me it seems to hit a block when trying to activate the Javascript. 对我来说,在尝试激活Javascript时似乎遇到了障碍。 I am aware that javascript can not run in scrapy but there must be a way of getting to those links.
我知道javascript无法在scrapy中运行,但必须有一种方法来获取这些链接。 I have looked at selenium but can not get a handle on it.
我看过硒但是无法掌握它。
Any and all help welcome. 欢迎任何和所有帮助。
I've solved it with ScrapyJS
. 我用
ScrapyJS
解决了这个ScrapyJS
。
Follow the setup instructions in the official documentation and this answer . 请按照官方文档和此答案中的设置说明进行操作。
Here is the test spider I've used: 这是我用过的测试蜘蛛:
# -*- coding: utf-8 -*-
import scrapy
class TestSpider(scrapy.Spider):
name = "run"
allowed_domains = ["stopitrightnow.com"]
start_urls = (
'http://www.stopitrightnow.com/',
)
def start_requests(self):
for url in self.start_urls:
yield scrapy.Request(url, meta={
'splash': {
'endpoint': 'render.html',
'args': {'wait': 0.5}
}
})
def parse(self, response):
for widget in response.xpath('//div[@class="shopthepost-widget"]'):
print widget.xpath('.//a/@href').extract()
And here is what I've got on the console: 这是我在控制台上得到的:
[u'http://rstyle.me/iA-n/7bk8r4c_', u'http://rstyle.me/iA-n/7bk754c_', u'http://rstyle.me/iA-n/6th5d4c_', u'http://rstyle.me/iA-n/7bm3s4c_', u'http://rstyle.me/iA-n/2xeat4c_', u'http://rstyle.me/iA-n/7bi7f4c_', u'http://rstyle.me/iA-n/66abw4c_', u'http://rstyle.me/iA-n/7bm4j4c_']
[u'http://rstyle.me/iA-n/zzhv34c_', u'http://rstyle.me/iA-n/zzhvw4c_', u'http://rstyle.me/iA-n/zwuvk4c_', u'http://rstyle.me/iA-n/zzhvr4c_', u'http://rstyle.me/iA-n/zzh9g4c_', u'http://rstyle.me/iA-n/zzhz54c_', u'http://rstyle.me/iA-n/zwuuy4c_', u'http://rstyle.me/iA-n/zzhx94c_']
A non-javascript alternative to Alecxe's is to inspect where the page is loading the content from manually, and adding in that functionally ( see this SO question for more details) . Alecxe的非javascript替代方法是检查页面手动加载内容的位置,并在功能上添加( 请参阅此SO问题以获取更多详细信息) 。
In this case, we get the following: 在这种情况下,我们得到以下内容:
So, for <div class="shopthepost-widget" data-widget-id="708473">
, Javascript is executed to embed the url "widgets.rewardstyle.com/stps/ 708473 .html". 因此,对于
<div class="shopthepost-widget" data-widget-id="708473">
,JavaScript是执行嵌入URL “widgets.rewardstyle.com/stps/ 708473的.html”。
You could handle this yourself by manually generating a request for these URLs yourself: 您可以自己手动生成对这些URL的请求来自行处理:
def parse(self, response):
for widget in response.xpath('//div[@class="shopthepost-widget"]'):
widget_id = widget.xpath('@data-widget-id').extract()[0]
widget_url = "http://widgets.rewardstyle.com/stps/{id}.html".format(id=widget_id)
yield Request(widget_url, callback=self.parse_widget)
def parse_widget(self, response):
for link in response.xpath('//a[contains(@class, "stp-product")]'):
item = JavasItem() # Name provided by author, see comments below
item['link'] = links.xpath("@href").extract()
yield item
# Do whatever else you want with the opened page.
If you need to keep these widgets associated with whatever post/article they are a part of, pass that information into the request via meta
. 如果您需要将这些小部件与他们所属的任何帖子/文章相关联,请通过
meta
将该信息传递到请求中。
EDIT: parse_widget()
was updated. 编辑:
parse_widget()
已更新。 It uses contains
for figuring out the class, as it has a space at the end. 它使用
contains
来计算类,因为它最后有一个空格。 You could alternatively use a CSS selector, but it's really your call. 您也可以使用CSS选择器,但它确实是您的通话。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.