[英]Scrapy SgmlLinkExtractor - Having trouble with recursively scraping
更新:显然我无法在8小时内回答自己的问题,但我可以解决。 多谢你们!
我在抓取start_url上的链接时遇到麻烦。
以下是我的代码如下:
from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from dirbot.items import Website
class mydomainSpider(CrawlSpider):
name = "mydomain"
allowed_domains = ["mydomain.com"]
start_urls = ["http://www.mydomain.com/cp/133162",]
"""133162 category to crawl"""
rules = (
Rule(SgmlLinkExtractor(allow=('133162', ), deny=('/ip/', ))),
)
def parse(self, response):
hxs = HtmlXPathSelector(response)
sites = hxs.select('//html')
items = []
for site in sites:
item = Website()
item['description'] = site.select('//meta[@name="Description"]/@content').extract()
item['url'] = response.url
item['title'] = site.xpath('/html/head/title/text()').extract()
items.append(item)
return items
我是python的新手,欢迎任何建议。 感谢您的时间!
我知道了,谢谢大家!
from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from wallspider.items import Website
class mydomainSpider(CrawlSpider):
name = "mydomain"
allowed_domains = ["www.mydomain"]
start_urls = ["http://www.mydomain/cp/133162",]
rules = (Rule (SgmlLinkExtractor(allow=('133162', ),deny=('/ip/', 'search_sort=', 'ic=60_0', 'customer_rating', 'special_offers', ),)
, callback="parse_items", follow= True),
)
def parse_items(self, response):
hxs = HtmlXPathSelector(response)
sites = hxs.select('//*')
items = []
for site in sites:
item = Website()
item['referer'] = response.request.headers.get('Referer')
item['url'] = response.url
item['title'] = site.xpath('/html/head/title/text()').extract()
item['description'] = site.select('//meta[@name="Description"]/@content').extract()
items.append(item)
return items
一些观察:
yield
而不是累积数组中的项 蜘蛛似乎可以按照您的期望对URL进行爬网,而问题在于您如何解析页面。
extract()
返回的是一个列表,因此,如果item['description']
和item['title']
类型不是list ,我认为在存储这些项目时会遇到一些问题。
语句sites = sel.select('//html')
似乎不是必需的,它可能会导致重复数据。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.