[英]scrapy.Request does not callback my function
I'm sorry if my question is too trivial but I'm behind a wall since this morning... I'm new in scrapy and I already read the doc but I haven't found my answer... 抱歉,如果我的问题太琐碎,但自从今天早上以来我就一直在墙后...我是新手,我已经读过文档,但找不到答案...
I wrote this spider and when I call parse_body
in rules = (Rule(LinkExtractor(), callback='parse_body'),)
, it does : 我写了这种蜘蛛,当我打电话parse_body
在rules = (Rule(LinkExtractor(), callback='parse_body'),)
它的作用:
tchatch = response.xpath('//div[@class="ProductPriceBox-item detail"]/div/a/@href').extract()
print('\n TROUVE \n')
print(tchatch)
print('\n DONE \n')
But when I rename, everywhere in my code, the function parse_body
by just parse
, it just does : 但是,当我在代码中的任何地方重命名函数parse_body
,只需parse
即可:
print('\n EN FAIT, ICI : ', response.url, '\n')
It seems that my scrapy.Request
requests are never called.... I even print a lot of useless things to know if my code was running the functions but it prints nothing except the print
wrote above. 似乎从来没有调用过我的scrapy.Request
请求...。我什至打印了很多无用的东西,以了解我的代码是否正在运行这些功能,但除了上面写的打印内容外,它什么也不print
。
Any idea please? 有什么想法吗?
# -*- coding: utf-8 -*-
import scrapy
import re
import numbers
from fnac.items import FnacItem
from urllib.request import urlopen
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from bs4 import BeautifulSoup
class Fnac(CrawlSpider):
name = 'FnacCom'
allowed_domains = ['fnac.com']
start_urls = ['http://musique.fnac.com/a10484807/The-Cranberries-Something-else-CD-album']
rules = (
Rule(LinkExtractor(), callback='parse_body'),
)
def parse_body(self, response):
item = FnacItem()
nb_sales = response.xpath('//body//table[@summary="données détaillée du vendeur"]/tbody/tr/td/span/text()').re(r'([\d]*) ventes')
country = response.xpath('//body//table[@summary="données détaillée du vendeur"]/tbody/tr/td/text()').re(r'([A-Z].*)')
item['nb_sales'] = ''.join(nb_sales).strip()
item['country'] = ''.join(country).strip()
print(response.url)
test_list = response.xpath('//a/@href')
for test_list in response.xpath('.//div[@class="ProductPriceBox-item detail"]'):
tchatch = response.xpath('//div[@class="ProductPriceBox-item detail"]/div/a/@href').extract()
print('\n TROUVE \n')
print(tchatch)
print('\n DONE \n')
yield scrapy.Request(response.url, callback=self.parse_iframe, meta={'item': item})
def parse_iframe(self, response):
f_item1 = response.meta['item']
print('\n EN FAIT, ICI : ', response.url, '\n')
soup = BeautifulSoup(urlopen(response.url), "lxml")
iframexx = soup.find_all('iframe')
if (len(iframexx) != 0):
for iframe in iframexx:
yield scrapy.Request(iframe.attrs['src'], callback=self.extract_or_loop, meta={'item': f_item1})
else:
yield scrapy.Request(response.url, callback=self.extract_or_loop, meta={'item': f_item1})
def extract_or_loop(self, response):
f_item2 = response.meta['item']
print('\n PEUT ETRE ICI ? \n')
address = response.xpath('//body//div/p/text()').re(r'.*Adresse \: (.*)\n?.*')
email = response.xpath('//body//div/ul/li[contains(text(),"@")]/text()').extract()
name = response.xpath('//body//div/p[@class="customer-policy-label"]/text()').re(r'Infos sur la boutique \: ([a-zA-Z0-9]*\s*)')
phone = response.xpath('//body//div/p/text()').re(r'.*Tél \: ([\d]*)\n?.*')
siret = response.xpath('//body//div/p/text()').re(r'.*Siret \: ([\d]*)\n?.*')
vat = response.xpath('//body//div/text()').re(r'.*TVA \: (.*)')
if (len(name) != 0):
print('\n', name, '\n')
f_item2['name'] = ''.join(name).strip()
f_item2['address'] = ''.join(address).strip()
f_item2['phone'] = ''.join(phone).strip()
f_item2['email'] = ''.join(email).strip()
f_item2['vat'] = ''.join(vat).strip()
f_item2['siret'] = ''.join(siret).strip()
yield f_item2
else:
for sel in response.xpath('//html/body'):
list_urls = sel.xpath('//a/@href').extract()
list_iframe = response.xpath('//div[@class="ProductPriceBox-item detail"]/div/a/@href').extract()
if (len(list_iframe) != 0):
for list_iframe in list_urls:
print('\n', list_iframe, '\n')
print('\n GROS TCHATCH \n')
yield scrapy.Request(list_iframe, callback=self.parse_body)
for url in list_urls:
yield scrapy.Request(response.urljoin(url), callback=self.parse_body)
In the scrapy documentation for the CrawlSpider, there is a warning: 在CrawlSpider的草稿文档中,有一个警告:
Warning 警告
When writing crawl spider rules, avoid using
parse
as callback, since the CrawlSpider uses theparse
method itself to implement its logic. 编写爬网蜘蛛规则时,请避免将parse
用作回调,因为CrawlSpider使用parse
方法本身来实现其逻辑。 So if you override theparse
method, the crawl spider will no longer work. 因此,如果您覆盖parse
方法,则爬网蜘蛛将不再起作用。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.