简体   繁体   English

如何将变量传递给 Scrapy Spider

[英]How to pass a variable to Scrapy Spider

I have a list in spider class.我在蜘蛛 class 中有一个列表。 I need to initialize it.我需要初始化它。 This is what code looks like:这是代码的样子:

class Myspider(SitemapSpider):
    name = 'spidername'

    sitemap_urls = [
                    'https://www.arabam.com/sitemap/otomobil_13.xml']
sitemap_rules = [
    ('/otomobil/', 'parse'),

]
custom_settings = {'FEED_FORMAT':'csv','FEED_URI': "arabam_"+str(datetime.today().strftime('%d%m%y'))+'.csv'
                   }
crawled = []
new_links = 0
def parse(self,response):
    if self.new_links >3:
        with open("URLs", "wb") as f:

                pickle.dump(self.crawled, f)
        self.new_links = 0
    for td in response.xpath("/html/body/div[3]/div[6]/div[4]/div/div[2]/table/tbody/tr/td[4]/div/a"):

        if link[0] not in self.crawled:

            self.crawled.append(link[0])
#################################some code #################################一些代码
Traceback (most recent call last):
  File "C:\Users\fatima.arshad\AppData\Local\Continuum\anaconda2\envs\web_scraping\lib\site-packages\twisted\internet\defer.py", line 151, in maybeDeferred
    result = f(*args, **kw)
  File "C:\Users\fatima.arshad\AppData\Local\Continuum\anaconda2\envs\web_scraping\lib\site-packages\pydispatch\robustapply.py", line 55, in robustApply
    return receiver(*arguments, **named)
  File "C:\Users\fatima.arshad\AppData\Local\Continuum\anaconda2\envs\web_scraping\lib\site-packages\scrapy\extensions\feedexport.py", line 262, in item_scraped
    slot = self.slot
AttributeError: 'FeedExporter' object has no attribute 'slot'

It keeps throwing following exception:它不断抛出以下异常:

Traceback (most recent call last):
  File "C:\Users\fatima.arshad\AppData\Local\Continuum\anaconda2\envs\web_scraping\lib\site-packages\twisted\internet\defer.py", line 151, in maybeDeferred
    result = f(*args, **kw)
  File "C:\Users\fatima.arshad\AppData\Local\Continuum\anaconda2\envs\web_scraping\lib\site-packages\pydispatch\robustapply.py", line 55, in robustApply
    return receiver(*arguments, **named)
  File "C:\Users\fatima.arshad\AppData\Local\Continuum\anaconda2\envs\web_scraping\lib\site-packages\scrapy\extensions\feedexport.py", line 232, in open_spider
    uri = self.urifmt % self._get_uri_params(spider)
  File "C:\Users\fatima.arshad\AppData\Local\Continuum\anaconda2\envs\web_scraping\lib\site-packages\scrapy\extensions\feedexport.py", line 313, in _get_uri_params
    params[k] = getattr(spider, k)
  File "C:\Users\fatima.arshad\AppData\Local\Continuum\anaconda2\envs\web_scraping\lib\site-packages\scrapy\spiders\__init__.py", line 36, in logger
    logger = logging.getLogger(self.name)
  File "C:\Users\fatima.arshad\AppData\Local\Continuum\anaconda2\envs\web_scraping\lib\logging\__init__.py", line 1845, in getLogger
    return Logger.manager.getLogger(name)
  File "C:\Users\fatima.arshad\AppData\Local\Continuum\anaconda2\envs\web_scraping\lib\logging\__init__.py", line 1174, in getLogger
    raise TypeError('A logger name must be a string')
TypeError: A logger name must be a string

According to some resource it is because of this:根据一些资源,这是因为:

 Traceback (most recent call last): File "C:\Users\fatima.arshad\AppData\Local\Continuum\anaconda2\envs\web_scraping\lib\site-packages\twisted\internet\defer.py", line 151, in maybeDeferred result = f(*args, **kw) File "C:\Users\fatima.arshad\AppData\Local\Continuum\anaconda2\envs\web_scraping\lib\site-packages\pydispatch\robustapply.py", line 55, in robustApply return receiver(*arguments, **named) File "C:\Users\fatima.arshad\AppData\Local\Continuum\anaconda2\envs\web_scraping\lib\site-packages\scrapy\extensions\feedexport.py", line 232, in open_spider uri = self.urifmt % self._get_uri_params(spider) File "C:\Users\fatima.arshad\AppData\Local\Continuum\anaconda2\envs\web_scraping\lib\site-packages\scrapy\extensions\feedexport.py", line 313, in _get_uri_params params[k] = getattr(spider, k) File "C:\Users\fatima.arshad\AppData\Local\Continuum\anaconda2\envs\web_scraping\lib\site-packages\scrapy\spiders\__init__.py", line 36, in logger logger = logging.getLogger(self.name) File "C:\Users\fatima.arshad\AppData\Local\Continuum\anaconda2\envs\web_scraping\lib\logging\__init__.py", line 1845, in getLogger return Logger.manager.getLogger(name) File "C:\Users\fatima.arshad\AppData\Local\Continuum\anaconda2\envs\web_scraping\lib\logging\__init__.py", line 1174, in getLogger raise TypeError('A logger name must be a string') TypeError: A logger name must be a string

How do I pass it the list or there is any way that this list canbe initialized only once with scrapy spider?如何将列表传递给它,或者有什么方法可以使用 scrapy 蜘蛛只初始化一次该列表? List contains all the urls that have been crawled.列表包含所有已爬取的 url。 This list is pickled.这个列表是腌制的。 When the code starts, it initializes this list and crawls further only if the link is not present in this list.当代码启动时,它会初始化此列表并仅在此列表中不存在链接时才进一步爬网。

You need to pass the list of urls using the spider attribute name (which is crawled ) in your case.在您的情况下,您需要使用蜘蛛属性名称(已crawled )传递 url 列表。

According to the docs , if you don't override the __init__ method of the spider, all the passed arguments to the spider class are mapped to the spider attributes.根据文档,如果您不覆盖蜘蛛的__init__方法,则所有传递给蜘蛛 class 的 arguments 都会映射到蜘蛛属性。 So in order to override the crawled attribute, you need to send the extact argument name.因此,为了覆盖crawled取的属性,您需要发送准确的参数名称。

Something like this:像这样的东西:

process = CrawlerProcess()
crawled_urls = []

try:
    with (open("URLs", "rb")) as openfile:
        while True:
            try:
                crawled_urls = pickle.load(openfile)

            except EOFError:
                break
except:
    with open("URLs", "wb") as f:
        pickle.dump("", f)

print(crawled_urls)
process.crawl(Myspider, crawled=crawled_urls)
process.start() # the script wi

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM