![](/img/trans.png)
[英]Can scrapy be used to scrape dynamic content from websites that are using AJAX?
[英]Scrape websites using scrapy
我正在嘗試用scrapy刮一個網站 ,但是我在從這個網站上抓取所有產品時遇到問題,因為它正在使用無休止的滾動...
我只能抓下52件商品的數據,但他們是3824件商品。
hxs.select("//span[@class='itm-Catbrand strong']").extract()
hxs.select("//span[@class='itm-price ']").extract()
hxs.select("//span[@class='itm-title']").extract()
如果我使用hxs.select("//div[@id='content']/div/div/div").extract()
然后它提取整個項目列表但不會進一步過濾....如何做我刮掉所有物品?
我試過這個但結果相同。 我哪里錯了?
def parse(self, response):
filename = response.url.split("/")[-2]
open(filename, 'wb').write(response.body
for n in [2,3,4,5,6]:
req = Request(url="http://www.jabong.com/men/shoes/?page=" + n,
headers = {"Referer": "http://www.jabong.com/men/shoes/",
"X-Requested-With": response.header['X-Requested-With']})
return req
正如您所猜測的,當您滾動頁面時,此網站使用javascript加載更多項目。
使用瀏覽器中包含的開發人員工具(Ctrl-Maj i for chromium),我在網絡選項卡中看到頁面中包含的javascript腳本執行以下請求以加載更多項目:
GET http://www.website-your-are-crawling.com/men/shoes/?page=2 # 2,3,4,5,6 etc...
Web服務器使用以下類型的文檔進行響應:
<li id="PH969SH70HPTINDFAS" class="itm hasOverlay unit size1of4 ">
<div id="qa-quick-view-btn" class="quickviewZoom itm-quickview ui-buttonQuickview l-absolute pos-t" title="Quick View" data-url ="phosphorus-Black-Moccasins-233629.html" data-sku="PH969SH70HPTINDFAS" onClick="_gaq.push(['_trackEvent', 'BadgeQV','Shown','OFFER INSIDE']);">Quick view</div>
<div class="itm-qlInsert tooltip-qlist highlightStar"
onclick="javascript:Rocket.QuickList.insert('PH969SH70HPTINDFAS', 'catalog');
return false;" >
<div class="starHrMsg">
<span class="starHrMsgArrow"> </span>
Save for later </div>
</div>
<a id='cat_105_PH969SH70HPTINDFAS' class="itm-link sobrTxt" href="/phosphorus-Black-Moccasins-233629.html"
onclick="fireGaq('_trackEvent', 'Catalog to PDP', 'men--Shoes--Moccasins', 'PH969SH70HPTINDFAS--1699.00--', this),fireGaq('_trackEvent', 'BadgePDP','Shown','OFFER INSIDE', this);">
<span class="lazyImage">
<span style="width:176px;height:255px;" class="itm-imageWrapper itm-imageWrapper-PH969SH70HPTINDFAS" id="http://static4.jassets.com/p/Phosphorus-Black-Moccasins-6668-926332-1-catalog.jpg" itm-img-width="176" itm-img-height="255" itm-img-sprites="4">
<noscript><img src="http://static4.jassets.com/p/Phosphorus-Black-Moccasins-6668-926332-1-catalog.jpg" width="176" height="255" class="itm-img"></noscript>
</span>
</span>
<span class="itm-budgeFlag offInside"><span class="flagBrdLeft"></span>OFFER INSIDE</span>
<span class="itm-Catbrand strong">Phosphorus</span>
<span class="itm-title">
Black Moccasins </span>
這些文件包含更多項目。
因此,要獲得完整的項目列表,您必須在Spider的parse
方法中返回Request
對象(請參閱Spider類文檔 ),告訴scrapy它應該加載更多數據:
def parse(self, response):
# ... Extract items in the page using extractors
n = number of the next "page" to parse
# You get get n by using response.url, extracting the number
# at the end and adding 1
# It is VERY IMPORTANT to set the Referer and X-Requested-With headers
# here because that's how the website detects if the request was made by javascript
# or direcly by following a link.
req = Request(url="http://www.website-your-are-crawling.com/men/shoes/?page=" + n,
headers = {"Referer": "http://www.website-your-are-crawling.com/men/shoes/",
"X-Requested-With": "XMLHttpRequest"})
return req # and your items
哦,順便說一句(如果你想測試),你不能只在瀏覽器中加載http://www.website-your-are-crawling.com/men/shoes/?page=2
看看返回的內容是因為如果X-Requested-With
標頭與XMLHttpRequest
不同,網站會將您重定向到全局頁面(即http://www.website-your-are-crawling.com/men/shoes/
)。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.