簡體   English   中英

如何將變量傳遞給 Scrapy Spider

[英]How to pass a variable to Scrapy Spider

我在蜘蛛 class 中有一個列表。 我需要初始化它。 這是代碼的樣子:

class Myspider(SitemapSpider):
    name = 'spidername'

    sitemap_urls = [
                    'https://www.arabam.com/sitemap/otomobil_13.xml']
sitemap_rules = [
    ('/otomobil/', 'parse'),

]
custom_settings = {'FEED_FORMAT':'csv','FEED_URI': "arabam_"+str(datetime.today().strftime('%d%m%y'))+'.csv'
                   }
crawled = []
new_links = 0
def parse(self,response):
    if self.new_links >3:
        with open("URLs", "wb") as f:

                pickle.dump(self.crawled, f)
        self.new_links = 0
    for td in response.xpath("/html/body/div[3]/div[6]/div[4]/div/div[2]/table/tbody/tr/td[4]/div/a"):

        if link[0] not in self.crawled:

            self.crawled.append(link[0])
#################################一些代碼
Traceback (most recent call last):
  File "C:\Users\fatima.arshad\AppData\Local\Continuum\anaconda2\envs\web_scraping\lib\site-packages\twisted\internet\defer.py", line 151, in maybeDeferred
    result = f(*args, **kw)
  File "C:\Users\fatima.arshad\AppData\Local\Continuum\anaconda2\envs\web_scraping\lib\site-packages\pydispatch\robustapply.py", line 55, in robustApply
    return receiver(*arguments, **named)
  File "C:\Users\fatima.arshad\AppData\Local\Continuum\anaconda2\envs\web_scraping\lib\site-packages\scrapy\extensions\feedexport.py", line 262, in item_scraped
    slot = self.slot
AttributeError: 'FeedExporter' object has no attribute 'slot'

它不斷拋出以下異常:

Traceback (most recent call last):
  File "C:\Users\fatima.arshad\AppData\Local\Continuum\anaconda2\envs\web_scraping\lib\site-packages\twisted\internet\defer.py", line 151, in maybeDeferred
    result = f(*args, **kw)
  File "C:\Users\fatima.arshad\AppData\Local\Continuum\anaconda2\envs\web_scraping\lib\site-packages\pydispatch\robustapply.py", line 55, in robustApply
    return receiver(*arguments, **named)
  File "C:\Users\fatima.arshad\AppData\Local\Continuum\anaconda2\envs\web_scraping\lib\site-packages\scrapy\extensions\feedexport.py", line 232, in open_spider
    uri = self.urifmt % self._get_uri_params(spider)
  File "C:\Users\fatima.arshad\AppData\Local\Continuum\anaconda2\envs\web_scraping\lib\site-packages\scrapy\extensions\feedexport.py", line 313, in _get_uri_params
    params[k] = getattr(spider, k)
  File "C:\Users\fatima.arshad\AppData\Local\Continuum\anaconda2\envs\web_scraping\lib\site-packages\scrapy\spiders\__init__.py", line 36, in logger
    logger = logging.getLogger(self.name)
  File "C:\Users\fatima.arshad\AppData\Local\Continuum\anaconda2\envs\web_scraping\lib\logging\__init__.py", line 1845, in getLogger
    return Logger.manager.getLogger(name)
  File "C:\Users\fatima.arshad\AppData\Local\Continuum\anaconda2\envs\web_scraping\lib\logging\__init__.py", line 1174, in getLogger
    raise TypeError('A logger name must be a string')
TypeError: A logger name must be a string

根據一些資源,這是因為:

 Traceback (most recent call last): File "C:\Users\fatima.arshad\AppData\Local\Continuum\anaconda2\envs\web_scraping\lib\site-packages\twisted\internet\defer.py", line 151, in maybeDeferred result = f(*args, **kw) File "C:\Users\fatima.arshad\AppData\Local\Continuum\anaconda2\envs\web_scraping\lib\site-packages\pydispatch\robustapply.py", line 55, in robustApply return receiver(*arguments, **named) File "C:\Users\fatima.arshad\AppData\Local\Continuum\anaconda2\envs\web_scraping\lib\site-packages\scrapy\extensions\feedexport.py", line 232, in open_spider uri = self.urifmt % self._get_uri_params(spider) File "C:\Users\fatima.arshad\AppData\Local\Continuum\anaconda2\envs\web_scraping\lib\site-packages\scrapy\extensions\feedexport.py", line 313, in _get_uri_params params[k] = getattr(spider, k) File "C:\Users\fatima.arshad\AppData\Local\Continuum\anaconda2\envs\web_scraping\lib\site-packages\scrapy\spiders\__init__.py", line 36, in logger logger = logging.getLogger(self.name) File "C:\Users\fatima.arshad\AppData\Local\Continuum\anaconda2\envs\web_scraping\lib\logging\__init__.py", line 1845, in getLogger return Logger.manager.getLogger(name) File "C:\Users\fatima.arshad\AppData\Local\Continuum\anaconda2\envs\web_scraping\lib\logging\__init__.py", line 1174, in getLogger raise TypeError('A logger name must be a string') TypeError: A logger name must be a string

如何將列表傳遞給它,或者有什么方法可以使用 scrapy 蜘蛛只初始化一次該列表? 列表包含所有已爬取的 url。 這個列表是腌制的。 當代碼啟動時,它會初始化此列表並僅在此列表中不存在鏈接時才進一步爬網。

在您的情況下,您需要使用蜘蛛屬性名稱(已crawled )傳遞 url 列表。

根據文檔,如果您不覆蓋蜘蛛的__init__方法,則所有傳遞給蜘蛛 class 的 arguments 都會映射到蜘蛛屬性。 因此,為了覆蓋crawled取的屬性,您需要發送准確的參數名稱。

像這樣的東西:

process = CrawlerProcess()
crawled_urls = []

try:
    with (open("URLs", "rb")) as openfile:
        while True:
            try:
                crawled_urls = pickle.load(openfile)

            except EOFError:
                break
except:
    with open("URLs", "wb") as f:
        pickle.dump("", f)

print(crawled_urls)
process.crawl(Myspider, crawled=crawled_urls)
process.start() # the script wi

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM