[英]Scraping within a url using scrapy
我正在嘗試使用scrapy抓取craigslist,並且已經成功獲取了url,但是現在我想從url的頁面內提取數據。 以下是代碼:
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from craigslist.items import CraigslistItem
class craigslist_spider(BaseSpider):
name = "craigslist_unique"
allowed_domains = ["craiglist.org"]
start_urls = [
"http://sfbay.craigslist.org/search/sof?zoomToPosting=&query=&srchType=A&addFour=part-time",
"http://newyork.craigslist.org/search/sof?zoomToPosting=&query=&srchType=A&addThree=internship",
"http://seattle.craigslist.org/search/sof?zoomToPosting=&query=&srchType=A&addFour=part-time"
]
def parse(self, response):
hxs = HtmlXPathSelector(response)
sites = hxs.select("//span[@class='pl']")
items = []
for site in sites:
item = CraigslistItem()
item['title'] = site.select('a/text()').extract()
item['link'] = site.select('a/@href').extract()
#item['desc'] = site.select('text()').extract()
items.append(item)
hxs = HtmlXPathSelector(response)
#print title, link
return items
我是新手,無法弄清楚如何實際訪問URL(href)並在該URL的頁面中獲取數據並對所有URL執行此操作。
start_urls
響應在parse
方法中一一收到
如果您只想從該start_urls
響應中獲取信息,則您的代碼幾乎可以。 但是您的parse方法應該位於craigslist_spider
類中,而不是該類之外。
def parse(self, response):
hxs = HtmlXPathSelector(response)
sites = hxs.select("//span[@class='pl']")
items = []
for site in sites:
item = CraigslistItem()
item['title'] = site.select('a/text()').extract()
item['link'] = site.select('a/@href').extract()
items.append(item)
#print title, link
return items
如果您想從start_urls中獲取一半信息,並從start_urls
響應中存在的anchor
中獲取一半信息,該怎么辦?
def parse(self, response):
hxs = HtmlXPathSelector(response)
sites = hxs.select("//span[@class='pl']")
for site in sites:
item = CraigslistItem()
item['title'] = site.select('a/text()').extract()
item['link'] = site.select('a/@href').extract()
if item['link']:
if 'http://' not in item['link']:
item['link'] = urljoin(response.url, item['link'])
yield Request(item['link'],
meta={'item': item},
callback=self.anchor_page)
def anchor_page(self, response):
hxs = HtmlXPathSelector(response)
old_item = response.request.meta['item'] # Receiving parse Method item that was in Request meta
# parse some more values
#place them in old_item
#e.g
old_item['bla_bla']=hxs.select("bla bla").extract()
yield old_item
您只需要在解析方法中yield Request
並使用Request
meta
運送您的old item
然后在old_item
中提取old_item
,在anchor_page
添加新值並簡單地產生它。
您的xpath中有一個問題-它們應該是相對的。 這是代碼:
from scrapy.item import Item, Field
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
class CraigslistItem(Item):
title = Field()
link = Field()
class CraigslistSpider(BaseSpider):
name = "craigslist_unique"
allowed_domains = ["craiglist.org"]
start_urls = [
"http://sfbay.craigslist.org/search/sof?zoomToPosting=&query=&srchType=A&addFour=part-time",
"http://newyork.craigslist.org/search/sof?zoomToPosting=&query=&srchType=A&addThree=internship",
"http://seattle.craigslist.org/search/sof?zoomToPosting=&query=&srchType=A&addFour=part-time"
]
def parse(self, response):
hxs = HtmlXPathSelector(response)
sites = hxs.select("//span[@class='pl']")
items = []
for site in sites:
item = CraigslistItem()
item['title'] = site.select('.//a/text()').extract()[0]
item['link'] = site.select('.//a/@href').extract()[0]
items.append(item)
return items
如果通過以下方式運行:
scrapy runspider spider.py -o output.json
您將在output.json中看到:
{"link": "/sby/sof/3824966457.html", "title": "HR Admin/Tech Recruiter"}
{"link": "/eby/sof/3824932209.html", "title": "Entry Level Web Developer"}
{"link": "/sfc/sof/3824500262.html", "title": "Sr. Ruby on Rails Contractor @ Funded Startup"}
...
希望能有所幫助。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.