[英]Scraped data printing to terminal, but not saving in CSV file
我正在開發一個 scrapy 項目,以從 Metacritic 上抓取視頻游戲產品信息和評論。 我想要的數據位於不同的頁面上,我想將產品信息刮到一個 CSV 並評論到另一個 CSV。 因此,我的代碼比“抓取數據,產出項目”更復雜。 我需要產生一種物品(產品信息),然后向游戲的評論頁面發出對 go 的請求,並產生另一種物品(產品評論)。
我當前的代碼正在運行,但是被抓取的數據打印到 anaconda 提示終端 window,而 CSV 文件仍然為空。 不過,所有數據都被正確抓取,因為我可以在終端中看到它。 問題似乎是如何在 pipeline.py 中生成和處理項目。
下面是 items.py、myspider.py 和 pipeline.py 的代碼。 蜘蛛代碼已被大幅編輯,只包含相關部分,因為它相當長且復雜。
items.py:
import scrapy
class GameItem(scrapy.Item):
url = scrapy.Field()
title = scrapy.Field()
platform = scrapy.Field()
genres = scrapy.Field()
release_date = scrapy.Field()
ESRB_rating = scrapy.Field()
summary = scrapy.Field()
average_user_score = scrapy.Field()
metascore = scrapy.Field()
developer = scrapy.Field()
publisher = scrapy.Field()
class ReviewItem(scrapy.Item):
title = scrapy.Field()
platform = scrapy.Field()
username = scrapy.Field()
score = scrapy.Field()
date = scrapy.Field()
review_text = scrapy.Field()
critic_flag = scrapy.Field()
game_spider.py:
from scrapy import Spider, Request
from games.items import GameItem, ReviewItem
class GameSpider(Spider):
name = 'game_spider'
allowed_urls = ['https://www.metacritic.com']
start_urls = ['https://www.metacritic.com/browse/games/score/metascore/all/all/filtered?sort=desc&page=0']
def parse(self, response):
page_urls = #scrape all result pages
for url in page_urls:
yield Request(url=url, callback=self.parse_game_urls, dont_filter = True)
def parse_game_urls(self, response):
game_urls = #scrape each game url from each result page
for url in game_urls:
yield Request(url=url, callback=self.parse_game_page, dont_filter = True)
def parse_game_page(self, response):
#scrape game info
item = GameItem()
item['url'] = url
item['title'] = title
item['platform'] = platform
item['genres'] = genres
item['release_date'] = release_date
item['ESRB_rating'] = ESRB_rating
item['summary'] = summary
item['average_user_score'] = average_user_score
item['metascore'] = metascore
item['developer'] = developer
item['publisher'] = publisher
yield item
user_review_page = # scrape url to review page
yield Request(url=user_review_page, callback=self.parse_user_reviews, dont_filter = True)
def parse_user_reviews(self, response):
reviews = #scrape all reviews
for review in reviews:
#scrape review info
item = ReviewItem()
item['title'] = title
item['platform'] = platform
item['username'] = username
item['score'] = int(score)
item['date'] = date
item['review_text'] = review_text
item['critic_flag'] = 0
yield item
pipelines.py:
from scrapy.exporters import CsvItemExporter
from scrapy import signals
from pydispatch import dispatcher
class GamesPipeline(object):
def __init__(self):
self.fileNamesCsv = ['GameItem','ReviewItem']
self.files = {}
self.exporters = {}
dispatcher.connect(self.spider_opened, signal=signals.spider_opened)
dispatcher.connect(self.spider_closed, signal=signals.spider_closed)
def spider_opened(self, spider):
self.files = dict([ (name, open(name + '.csv','wb')) for name in self.fileNamesCsv])
for name in self.fileNamesCsv:
self.exporters[name] = CsvItemExporter(self.files[name])
if name == 'GameItem':
self.exporters[name].fields_to_export = ['url','title','platform','genres','release_date','ESRB_rating','summary',
'average_user_score','metascore','developer','publisher']
self.exporters[name].start_exporting()
if name == 'ReviewItem':
self.exporters[name].fields_to_export = ['title','platform','username','score','date','review_text','critic_flag']
self.exporters[name].start_exporting()
def spider_closed(self, spider):
[e.finish_exporting() for e in self.exporters.values()]
[f.close() for f in self.files.values()]
def process_item(self, item, spider):
typesItem = type(item)
if typesItem in set(self.fileNamesCsv):
self.exporters[typesItem].export_item(item)
return item
如果有幫助,這就是終端 output 的樣子:
(base) C:\Users\bdbot\Desktop\games>scrapy crawl game_spider
2020-07-07 17:26:03 [scrapy.utils.log] INFO: Scrapy 1.6.0 started (bot: games)
2020-07-07 17:26:03 [scrapy.utils.log] INFO: Versions: lxml 4.3.4.0, libxml2 2.9.9, cssselect 1.1.0, parsel 1.5.2, w3lib 1.21.0, Twisted 18.9.0, Python 3.7.3 (default, Apr 24 2019, 15:29:51) [MSC v.1915 64 bit (AMD64)], pyOpenSSL 19.0.0 (OpenSSL 1.1.1g 21 Apr 2020), cryptography 2.7, Platform Windows-10-10.0.18362-SP0
2020-07-07 17:26:03 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'games', 'DOWNLOAD_DELAY': 2, 'NEWSPIDER_MODULE': 'games.spiders', 'SPIDER_MODULES': ['games.spiders'], 'USER_AGENT': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.61 Safari/537.36'}
2020-07-07 17:26:03 [scrapy.extensions.telnet] INFO: Telnet Password: 51cb3c8116353545
2020-07-07 17:26:03 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.logstats.LogStats']
2020-07-07 17:26:03 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2020-07-07 17:26:03 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2020-07-07 17:26:03 [scrapy.middleware] INFO: Enabled item pipelines:
['games.pipelines.GamesPipeline']
2020-07-07 17:26:03 [scrapy.core.engine] INFO: Spider opened
2020-07-07 17:26:03 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-07-07 17:26:03 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2020-07-07 17:26:03 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.metacritic.com/browse/games/score/metascore/all/all/filtered?sort=desc&page=0> (referer: None)
2020-07-07 17:26:06 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.metacritic.com/browse/games/score/metascore/all/all/filtered?sort=desc&page=0> (referer: https://www.metacritic.com/browse/games/score/metascore/all/all/filtered?sort=desc&page=0)
2020-07-07 17:26:15 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.metacritic.com/browse/games/score/metascore/all/all/filtered?sort=desc&page=129> (referer: https://www.metacritic.com/browse/games/score/metascore/all/all/filtered?sort=desc&page=0)
2020-07-07 17:26:18 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.metacritic.com/browse/games/score/metascore/all/all/filtered?sort=desc&page=126> (failed 1 times): 504 Gateway Time-out
2020-07-07 17:26:19 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.metacritic.com/browse/games/score/metascore/all/all/filtered?sort=desc&page=125> (failed 1 times): 504 Gateway Time-out
2020-07-07 17:26:22 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.metacritic.com/browse/games/score/metascore/all/all/filtered?sort=desc&page=128> (referer: https://www.metacritic.com/browse/games/score/metascore/all/all/filtered?sort=desc&page=0)
2020-07-07 17:26:25 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.metacritic.com/browse/games/score/metascore/all/all/filtered?sort=desc&page=127> (referer: https://www.metacritic.com/browse/games/score/metascore/all/all/filtered?sort=desc&page=0)
2020-07-07 17:26:33 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.metacritic.com/browse/games/score/metascore/all/all/filtered?sort=desc&page=124> (referer: https://www.metacritic.com/browse/games/score/metascore/all/all/filtered?sort=desc&page=0)
2020-07-07 17:26:37 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.metacritic.com/browse/games/score/metascore/all/all/filtered?sort=desc&page=123> (referer: https://www.metacritic.com/browse/games/score/metascore/all/all/filtered?sort=desc&page=0)
2020-07-07 17:26:40 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.metacritic.com/browse/games/score/metascore/all/all/filtered?sort=desc&page=122> (referer: https://www.metacritic.com/browse/games/score/metascore/all/all/filtered?sort=desc&page=0)
2020-07-07 17:26:41 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.metacritic.com/browse/games/score/metascore/all/all/filtered?sort=desc&page=121> (referer: https://www.metacritic.com/browse/games/score/metascore/all/all/filtered?sort=desc&page=0)
2020-07-07 17:26:41 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.metacritic.com/browse/games/score/metascore/all/all/filtered?sort=desc&page=117> (referer: https://www.metacritic.com/browse/games/score/metascore/all/all/filtered?sort=desc&page=0)
2020-07-07 17:26:44 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.metacritic.com/browse/games/score/metascore/all/all/filtered?sort=desc&page=120> (referer: https://www.metacritic.com/browse/games/score/metascore/all/all/filtered?sort=desc&page=0)
2020-07-07 17:26:45 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.metacritic.com/browse/games/score/metascore/all/all/filtered?sort=desc&page=119> (referer: https://www.metacritic.com/browse/games/score/metascore/all/all/filtered?sort=desc&page=0)
2020-07-07 17:26:48 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.metacritic.com/game/xbox/burnout-3-takedown> (referer: https://www.metacritic.com/browse/games/score/metascore/all/all/filtered?sort=desc&page=0)
2020-07-07 17:26:48 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.metacritic.com/game/xbox/burnout-3-takedown>
{'ESRB_rating': 'T',
'average_user_score': 7.6,
'developer': 'Criterion Games',
'genres': 'Driving, Racing, Arcade',
'metascore': 94.0,
'platform': 'Xbox',
'publisher': 'EA Games',
'release_date': 'Sep 7, 2004',
'summary': 'Burnout 3 challenges you to crash into (and through) busy '
'intersections, while creating as much damage as possible. You can '
'battle your way to the front of the pack by taking down rivals '
'and causing spectacular crashes. For those who thirst for '
'crashes, the game includes a crash mode that rewards you for '
'creating massive pileups. With multiplayer gameplay, more than '
'100 events, and 40 tracks, Burnout 3 provides intense speed and '
'action.',
'title': 'Burnout 3: Takedown',
'url': 'https://www.metacritic.com/game/xbox/burnout-3-takedown'}
Finished Scraping Burnout 3: Takedown
2020-07-07 17:26:50 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.metacritic.com/game/playstation-4/assassins-creed-chronicles-india> (referer: https://www.metacritic.com/browse/games/score/metascore/all/all/filtered?sort=desc&page=129)
2020-07-07 17:26:50 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.metacritic.com/game/playstation-4/assassins-creed-chronicles-india>
依此類推,對於每個游戲項目和每個評論項目。 它們都打印到終端 window。
嘗試打印新創建的 csv 文件的絕對路徑,以仔細檢查它們的創建位置。 這是一些偽代碼:
# pipelines.py file
import os
...
def spider_opened(self, spider):
self.files = dict([ (name, open(name + '.csv','wb')) for name in self.fileNamesCsv])
for name in self.fileNamesCsv:
print(os.path.realpath(self.files[name].name)) # new
self.exporters[name] = CsvItemExporter(self.files[name])
...
將我的 pipeline.py 重寫為兩個單獨的類解決了我的問題:
class GamesPipeline(object):
def __init__(self):
self.filename = 'games.csv'
def open_spider(self, spider):
self.csvfile = open(self.filename, 'wb')
self.exporter = CsvItemExporter(self.csvfile)
self.exporter.start_exporting()
def close_spider(self, spider):
self.exporter.finish_exporting()
self.csvfile.close()
def process_item(self, item, spider):
if isinstance(item, GameItem):
self.exporter.export_item(item)
return item
class ReviewsPipeline(object):
def __init__(self):
self.filename = 'game_reviews.csv'
def open_spider(self, spider):
self.csvfile = open(self.filename, 'wb')
self.exporter = CsvItemExporter(self.csvfile)
self.exporter.start_exporting()
def close_spider(self, spider):
self.exporter.finish_exporting()
self.csvfile.close()
def process_item(self, item, spider):
if isinstance(item, ReviewItem):
self.exporter.export_item(item)
return item
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.