![](/img/trans.png)
[英]IndentationError: unexpected indent on def parse_item(self, response) for scrapy,spider
[英]Scrapy Request 'IndentationError: unexpected indent' on parse callback
我正在使用 Scrapy CLI 並在 ubuntu 18 服務器上運行它,我試圖避免在 start_urls 屬性中硬編碼一堆 url,而是在我的解析底部調用“yield scrapy.Request()”。 我正在抓取的網站是相當基本的,並且在 2014-2030 年有不同的頁面。 在我的代碼底部,我有一個 if() 函數來檢查當前年份並將刮板移動到下一年的頁面。 我一般是scrapy的新手,所以我不確定我是否正確調用了scrapy.Request()方法。 這是我的代碼:
import scrapy
from .. import items
class EventSpider(scrapy.Spider):
name = "event_spider"
start_urls = [
"http://www.seasky.org/astronomy/astronomy-calendar-2014.html",
]
user_agent = "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1"
start_year = 2014
#response is the website
def parse(self, response):
CONTENT_SELECTOR = 'div#right-column-content ul li'
for astro_event in response.css(CONTENT_SELECTOR):
NAME_SELECTOR = "p span.title-text ::text"
DATE_SELECTOR = "p span.date-text ::text"
DESCRIPTION_SELECTOR = "p ::text"
item = items.AstroEventsItem()
item["title"] = astro_event.css(NAME_SELECTOR).extract_first()
item["date"] = astro_event.css(DATE_SELECTOR).extract_first()
item["description"] = astro_event.css(DESCRIPTION_SELECTOR)[-1].extract()
yield item
#Next page code:
#Goes through years 2014 to 2030
if(self.start_year < 2030):
self.start_year = self.start_year + 1
new_url = "http://www.seasky.org/astronomy/astronomy-calendar-" + str(self.start_year) + ".html"
print(new_url)
yield scrapy.Request(new_url, callback = self.parse)
這是我在成功抓取第一頁后收到的錯誤:
2020-11-10 05:25:50 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.seasky.org/astronomy/astronomy-calendar-2015.html> (referer: http://www.seasky.org/astronomy/astronomy-calendar-2014.html)
2020-11-10 05:25:50 [scrapy.core.scraper] ERROR: Spider error processing <GET http://www.seasky.org/astronomy/astronomy-calendar-2015.html> (referer: http://www.seasky.org/astronomy/astronomy-calendar-2014.html)
Traceback (most recent call last):
File "/home/jcmq6b/.local/lib/python3.6/site-packages/twisted/internet/defer.py", line 1418, in _inlineCallbacks
result = g.send(result)
StopIteration: <200 http://www.seasky.org/astronomy/astronomy-calendar-2015.html>
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/scrapy/utils/defer.py", line 55, in mustbe_deferred
result = f(*args, **kw)
File "/usr/local/lib/python3.6/dist-packages/scrapy/core/spidermw.py", line 58, in process_spider_input
return scrape_func(response, request, spider)
File "/usr/local/lib/python3.6/dist-packages/scrapy/core/scraper.py", line 149, in call_spider
warn_on_generator_with_return_value(spider, callback)
File "/usr/local/lib/python3.6/dist-packages/scrapy/utils/misc.py", line 245, in warn_on_generator_with_return_value
if is_generator_with_return_value(callable):
File "/usr/local/lib/python3.6/dist-packages/scrapy/utils/misc.py", line 230, in is_generator_with_return_value
tree = ast.parse(dedent(inspect.getsource(callable)))
File "/usr/lib/python3.6/ast.py", line 35, in parse
return compile(source, filename, mode, PyCF_ONLY_AST)
File "<unknown>", line 1
def parse(self, response):
^
IndentationError: unexpected indent
我想我可能沒有傳遞正確的參數來回調 parse 方法,但我不確定。 任何幫助深表感謝! 如果我需要發布更多信息,請告訴我。
對於偶然發現此問題的任何人,我沒有找到縮進錯誤的原因,但我確實通過將代碼分成兩種不同的解析方法找到了解決方法:
import scrapy
from .. import items
class EventSpider(scrapy.Spider):
name = "event_spider"
start_urls = ["http://www.seasky.org/astronomy/astronomy-calendar-2014.html"]
user_agent = "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1"
start_year = 2014
#Next page code:
def parse(self, response):
#Goes through years 2014 to 2030 from the href links at top of page
for href in response.css("div#top-links div h3 a::attr(href)"):
url = response.urljoin(href.extract())
yield scrapy.Request(url, callback=self.parse_contents)
#parses items for database
def parse_contents(self, response):
CONTENT_SELECTOR = 'div#right-column-content ul li'
for astro_event in response.css(CONTENT_SELECTOR):
NAME_SELECTOR = "p span.title-text ::text"
DATE_SELECTOR = "p span.date-text ::text"
DESCRIPTION_SELECTOR = "p ::text"
item = items.AstroEventsItem()
item["title"] = astro_event.css(NAME_SELECTOR).extract_first()
item["date"] = astro_event.css(DATE_SELECTOR).extract_first()
item["description"] = astro_event.css(DESCRIPTION_SELECTOR)[-1].extract()
yield item
第一個解析是從網站上列出的 href 中獲取 url。 然后它為每個 href 調用第二個解析方法 parse_contents 並將從頁面抓取的信息處理為 MongoDB 的項目。 如果有人遇到類似的問題,希望這可以幫助他們。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.