简体   繁体   English

Scrapy Spider无法保存到CSV

[英]Scrapy spider not saving to csv

I have a spider which reads a list of urls from a text file and saves the title and body text from each. 我有一个蜘蛛,它从文本文件中读取网址列表,并从每个文件中保存标题和正文。 The crawl works but the data does not get saved to csv. 抓取有效,但数据未保存到csv。 I set up a pipeline to save to csv because the normal -o option did not work for me. 我设置了一个保存到csv的管道,因为普通的-o选项对我不起作用。 I did change the settings.py for piepline. 我确实更改了piepline的settings.py。 Any help with this would be greatly appreciated. 任何帮助,将不胜感激。 The code is as follows: 代码如下:

Items.py Items.py

from scrapy.item import Item, Field

class PrivacyItem(Item):
    # define the fields for your item here like:
    # name = Field()
    title = Field()
    desc = Field()

PrivacySpider.py PrivacySpider.py

    from scrapy.contrib.spiders import CrawlSpider, Rule
    from scrapy.selector import HtmlXPathSelector
    from privacy.items import PrivacyItem

class PrivacySpider(CrawlSpider):
    name = "privacy"
    f = open("urls.txt")
    start_urls = [url.strip() for url in f.readlines()]
    f.close()

def parse(self, response):
    hxs = HtmlXPathSelector(response)
    items =[]
    for url in start_urls:
        item = PrivacyItem()
        item['desc'] = hxs.select('//body//p/text()').extract()
        item['title'] = hxs.select('//title/text()').extract()      
        items.append(item)

    return items

Pipelines.py Pipelines.py

import csv

class CSVWriterPipeline(object):

    def __init__(self):
        self.csvwriter = csv.writer(open('CONTENT.csv', 'wb'))

    def process_item(self, item, spider):
        self.csvwriter.writerow([item['title'][0], item['desc'][0]])
        return item

you don't have to loop on start_urls , scrapy is doing something like this: 您不必在start_urls上循环,scrapy正在执行以下操作:

for url in spider.start_urls:
    request url and call spider.parse() with its response

so your parse function should look something like: 因此您的解析函数应类似于:

def parse(self, response):
    hxs = HtmlXPathSelector(response)
    item = PrivacyItem()
    item['desc'] = hxs.select('//body//p/text()').extract()
    item['title'] = hxs.select('//title/text()').extract()      
    return item

also try to avoid returning lists as item fields, do something like: hxs.select('..').extract()[0] 还应尝试避免将列表作为项目字段返回,请执行以下操作: hxs.select('..').extract()[0]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM