[英]How to pass a variable to Scrapy Spider
我在蜘蛛 class 中有一个列表。 我需要初始化它。 这是代码的样子:
class Myspider(SitemapSpider):
name = 'spidername'
sitemap_urls = [
'https://www.arabam.com/sitemap/otomobil_13.xml']
sitemap_rules = [
('/otomobil/', 'parse'),
]
custom_settings = {'FEED_FORMAT':'csv','FEED_URI': "arabam_"+str(datetime.today().strftime('%d%m%y'))+'.csv'
}
crawled = []
new_links = 0
def parse(self,response):
if self.new_links >3:
with open("URLs", "wb") as f:
pickle.dump(self.crawled, f)
self.new_links = 0
for td in response.xpath("/html/body/div[3]/div[6]/div[4]/div/div[2]/table/tbody/tr/td[4]/div/a"):
if link[0] not in self.crawled:
self.crawled.append(link[0])
#################################一些代码
Traceback (most recent call last):
File "C:\Users\fatima.arshad\AppData\Local\Continuum\anaconda2\envs\web_scraping\lib\site-packages\twisted\internet\defer.py", line 151, in maybeDeferred
result = f(*args, **kw)
File "C:\Users\fatima.arshad\AppData\Local\Continuum\anaconda2\envs\web_scraping\lib\site-packages\pydispatch\robustapply.py", line 55, in robustApply
return receiver(*arguments, **named)
File "C:\Users\fatima.arshad\AppData\Local\Continuum\anaconda2\envs\web_scraping\lib\site-packages\scrapy\extensions\feedexport.py", line 262, in item_scraped
slot = self.slot
AttributeError: 'FeedExporter' object has no attribute 'slot'
它不断抛出以下异常:
Traceback (most recent call last):
File "C:\Users\fatima.arshad\AppData\Local\Continuum\anaconda2\envs\web_scraping\lib\site-packages\twisted\internet\defer.py", line 151, in maybeDeferred
result = f(*args, **kw)
File "C:\Users\fatima.arshad\AppData\Local\Continuum\anaconda2\envs\web_scraping\lib\site-packages\pydispatch\robustapply.py", line 55, in robustApply
return receiver(*arguments, **named)
File "C:\Users\fatima.arshad\AppData\Local\Continuum\anaconda2\envs\web_scraping\lib\site-packages\scrapy\extensions\feedexport.py", line 232, in open_spider
uri = self.urifmt % self._get_uri_params(spider)
File "C:\Users\fatima.arshad\AppData\Local\Continuum\anaconda2\envs\web_scraping\lib\site-packages\scrapy\extensions\feedexport.py", line 313, in _get_uri_params
params[k] = getattr(spider, k)
File "C:\Users\fatima.arshad\AppData\Local\Continuum\anaconda2\envs\web_scraping\lib\site-packages\scrapy\spiders\__init__.py", line 36, in logger
logger = logging.getLogger(self.name)
File "C:\Users\fatima.arshad\AppData\Local\Continuum\anaconda2\envs\web_scraping\lib\logging\__init__.py", line 1845, in getLogger
return Logger.manager.getLogger(name)
File "C:\Users\fatima.arshad\AppData\Local\Continuum\anaconda2\envs\web_scraping\lib\logging\__init__.py", line 1174, in getLogger
raise TypeError('A logger name must be a string')
TypeError: A logger name must be a string
根据一些资源,这是因为:
Traceback (most recent call last): File "C:\Users\fatima.arshad\AppData\Local\Continuum\anaconda2\envs\web_scraping\lib\site-packages\twisted\internet\defer.py", line 151, in maybeDeferred result = f(*args, **kw) File "C:\Users\fatima.arshad\AppData\Local\Continuum\anaconda2\envs\web_scraping\lib\site-packages\pydispatch\robustapply.py", line 55, in robustApply return receiver(*arguments, **named) File "C:\Users\fatima.arshad\AppData\Local\Continuum\anaconda2\envs\web_scraping\lib\site-packages\scrapy\extensions\feedexport.py", line 232, in open_spider uri = self.urifmt % self._get_uri_params(spider) File "C:\Users\fatima.arshad\AppData\Local\Continuum\anaconda2\envs\web_scraping\lib\site-packages\scrapy\extensions\feedexport.py", line 313, in _get_uri_params params[k] = getattr(spider, k) File "C:\Users\fatima.arshad\AppData\Local\Continuum\anaconda2\envs\web_scraping\lib\site-packages\scrapy\spiders\__init__.py", line 36, in logger logger = logging.getLogger(self.name) File "C:\Users\fatima.arshad\AppData\Local\Continuum\anaconda2\envs\web_scraping\lib\logging\__init__.py", line 1845, in getLogger return Logger.manager.getLogger(name) File "C:\Users\fatima.arshad\AppData\Local\Continuum\anaconda2\envs\web_scraping\lib\logging\__init__.py", line 1174, in getLogger raise TypeError('A logger name must be a string') TypeError: A logger name must be a string
如何将列表传递给它,或者有什么方法可以使用 scrapy 蜘蛛只初始化一次该列表? 列表包含所有已爬取的 url。 这个列表是腌制的。 当代码启动时,它会初始化此列表并仅在此列表中不存在链接时才进一步爬网。
在您的情况下,您需要使用蜘蛛属性名称(已crawled
)传递 url 列表。
根据文档,如果您不覆盖蜘蛛的__init__
方法,则所有传递给蜘蛛 class 的 arguments 都会映射到蜘蛛属性。 因此,为了覆盖crawled
取的属性,您需要发送准确的参数名称。
像这样的东西:
process = CrawlerProcess()
crawled_urls = []
try:
with (open("URLs", "rb")) as openfile:
while True:
try:
crawled_urls = pickle.load(openfile)
except EOFError:
break
except:
with open("URLs", "wb") as f:
pickle.dump("", f)
print(crawled_urls)
process.crawl(Myspider, crawled=crawled_urls)
process.start() # the script wi
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.