I've written a spider that seems to be functioning properly, but I'm not sure how to save the data it is collecting.
The spider starts out at TheScienceForum , grabs the main forum pages and makes an item for each. It then proceeds to go through all of the individual forum pages (passing the items along with it), adding the title of each thread to the matching forum item. The code is as follow:
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.http import Request
from individualProject.items import ProjectItem
class TheScienceForum(BaseSpider):
name = "TheScienceForum.com"
allowed_domains = ["www.thescienceforum.com"]
start_urls = ["http://www.thescienceforum.com"]
def parse(self, response):
Sel = HtmlXPathSelector(response)
forumNames = Sel.select('//h2[@class="forumtitle"]/a/text()').extract()
items = []
for forumName in forumNames:
item = ProjectItem()
item['name'] = forumName
items.append(item)
forums = Sel.select('//h2[@class="forumtitle"]/a/@href').extract()
itemDict = {}
itemDict['items'] = items
for forum in forums:
yield Request(url=forum,meta=itemDict,callback=self.addThreadNames)
def addThreadNames(self, response):
items = response.meta['items']
Sel = HtmlXPathSelector(response)
currentForum = Sel.select('//h1/span[@class="forumtitle"]/text()').extract()
for item in items:
if currentForum==item['name']:
item['thread'] += Sel.select('//h3[@class="threadtitle"]/a/text()').extract()
self.log(items)
itemDict = {}
itemDict['items'] = items
threadPageNavs = Sel.select('//span[@class="prev_next"]/a[@rel="next"]/@href').extract()
for threadPageNav in threadPageNavs:
yield Request(url=threadPageNav,meta=itemDict,callback=self.addThreadNames)
It seems that because I never simply return the object (only new requests are yielded) that the data never persists anywhere. I've tried using the following JSON pipeline:
class JsonWriterPipeline(object):
def __init__(self):
self.file = open('items.jl', 'wb')
def process_item(self, item, spider):
line = json.dumps(dict(item)) + "\n"
self.file.write(line)
return item
And also running the spider with the following:
scrapy crawl TheScienceForum.com -o items.json -t json
but so far nothing is working. Where might I be going wrong?
Any ideas or positive criticism is warmly welcomed.
您需要在至少一个回调中yield item
。
@bornytm -
self.addThreadNames
is your function to which you are passing the urls or intermediate results. If you want to save it to the file csv or json , you can do like follows
yield "result" ("result" can be replaced with your variable to which you are storing data. If you have multiple value , use yield in for loop. )
After that
scrapy crawl TheScienceForum.com -o output.csv -t csv
This will help you
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.