[英]Scrapy spider not saving to csv
我有一個蜘蛛,它從文本文件中讀取網址列表,並從每個文件中保存標題和正文。 抓取有效,但數據未保存到csv。 我設置了一個保存到csv的管道,因為普通的-o選項對我不起作用。 我確實更改了piepline的settings.py。 任何幫助,將不勝感激。 代碼如下:
Items.py
from scrapy.item import Item, Field
class PrivacyItem(Item):
# define the fields for your item here like:
# name = Field()
title = Field()
desc = Field()
PrivacySpider.py
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.selector import HtmlXPathSelector
from privacy.items import PrivacyItem
class PrivacySpider(CrawlSpider):
name = "privacy"
f = open("urls.txt")
start_urls = [url.strip() for url in f.readlines()]
f.close()
def parse(self, response):
hxs = HtmlXPathSelector(response)
items =[]
for url in start_urls:
item = PrivacyItem()
item['desc'] = hxs.select('//body//p/text()').extract()
item['title'] = hxs.select('//title/text()').extract()
items.append(item)
return items
Pipelines.py
import csv
class CSVWriterPipeline(object):
def __init__(self):
self.csvwriter = csv.writer(open('CONTENT.csv', 'wb'))
def process_item(self, item, spider):
self.csvwriter.writerow([item['title'][0], item['desc'][0]])
return item
您不必在start_urls
上循環,scrapy正在執行以下操作:
for url in spider.start_urls:
request url and call spider.parse() with its response
因此您的解析函數應類似於:
def parse(self, response):
hxs = HtmlXPathSelector(response)
item = PrivacyItem()
item['desc'] = hxs.select('//body//p/text()').extract()
item['title'] = hxs.select('//title/text()').extract()
return item
還應嘗試避免將列表作為項目字段返回,請執行以下操作: hxs.select('..').extract()[0]
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.