[英]Saving Web Crawling Results (Scrapy)
我寫了一個似乎運行正常的蜘蛛,但我不確定如何保存它正在收集的數據。
蜘蛛從TheScienceForum開始,抓住主要的論壇頁面並為每個頁面制作一個項目。 然后它繼續瀏覽所有單獨的論壇頁面(隨之傳遞項目),將每個線程的標題添加到匹配的論壇項目。 代碼如下:
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.http import Request
from individualProject.items import ProjectItem
class TheScienceForum(BaseSpider):
name = "TheScienceForum.com"
allowed_domains = ["www.thescienceforum.com"]
start_urls = ["http://www.thescienceforum.com"]
def parse(self, response):
Sel = HtmlXPathSelector(response)
forumNames = Sel.select('//h2[@class="forumtitle"]/a/text()').extract()
items = []
for forumName in forumNames:
item = ProjectItem()
item['name'] = forumName
items.append(item)
forums = Sel.select('//h2[@class="forumtitle"]/a/@href').extract()
itemDict = {}
itemDict['items'] = items
for forum in forums:
yield Request(url=forum,meta=itemDict,callback=self.addThreadNames)
def addThreadNames(self, response):
items = response.meta['items']
Sel = HtmlXPathSelector(response)
currentForum = Sel.select('//h1/span[@class="forumtitle"]/text()').extract()
for item in items:
if currentForum==item['name']:
item['thread'] += Sel.select('//h3[@class="threadtitle"]/a/text()').extract()
self.log(items)
itemDict = {}
itemDict['items'] = items
threadPageNavs = Sel.select('//span[@class="prev_next"]/a[@rel="next"]/@href').extract()
for threadPageNav in threadPageNavs:
yield Request(url=threadPageNav,meta=itemDict,callback=self.addThreadNames)
似乎因為我從不簡單地返回對象(只產生新的請求),數據永遠不會存在於任何地方。 我嘗試使用以下JSON管道:
class JsonWriterPipeline(object):
def __init__(self):
self.file = open('items.jl', 'wb')
def process_item(self, item, spider):
line = json.dumps(dict(item)) + "\n"
self.file.write(line)
return item
還運行以下蜘蛛:
scrapy crawl TheScienceForum.com -o items.json -t json
但到目前為止沒有任何工作。 我哪里可能出錯?
任何想法或積極的批評都受到熱烈歡迎。
您需要在至少一個回調中yield item
。
@bornytm -
self.addThreadNames
是您傳遞網址或中間結果的功能。 如果要將其保存到文件csv或json,可以執行以下操作
yield "result" ("result" can be replaced with your variable to which you are storing data. If you have multiple value , use yield in for loop. )
之后
scrapy crawl TheScienceForum.com -o output.csv -t csv
這會對你有所幫助
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.