簡體   English   中英

將抓取的數據導出到CSV文件

[英]Exporting scraped data to a CSV file

我正在嘗試從要求我在抓取數據之前遵循2個URL的網站獲取數據。

目的是獲得一個如下所示的導出文件:

清除電子表格中的數據,沒有空格或間隙

我的代碼如下:

import scrapy
from scrapy.item import Item, Field
from scrapy import Request

class myItems(Item):
    info1 = Field()
    info2 = Field()
    info3 = Field()
    info4 = Field()

class mySpider(scrapy.Spider):
    name = 'techbot'
    start_urls = ['']

    def parse(self, response):
        #Extracts first link
        items = []

        list1 = response.css("").extract() #extract all info from here

        for i in list1:
            link1 = 'https:...' + str(i)
            request = Request(link1, self.parseInfo1, dont_filter =True)
            request.meta['item'] = items
            yield request

        yield items

    def parseInfo1(self, response):
        #Extracts second link
        item = myItems()
        items = response.meta['item']

        list1 = response.css("").extract()
        for i in list1:
            link1 = '' + str(i)
            request = Request(link1, self.parseInfo2, dont_filter =True)
            request.meta['item'] = items
            items.append(item)
            return request

    def parseInfo2(self, response):
        #Extracts all data
        item = myItems()
        items = response.meta['item']
        item['info1'] = response.css("").extract()
        item['info2'] = response.css("").extract()
        item['info3'] = response.css("").extract()
        item['info4'] = response.css("").extract()
        items.append(item)
        return items

我已經在終端中使用以下命令執行了Spider:

scrapy crawl techbot

我得到的數據是亂序的,並且有這樣的差距: 亂序數據

例如,它將多次刮擦第一組數據,而其余的則亂序。

如果有人能指出我的指示,使之以一開始顯示的更干凈的格式獲得結果,將不勝感激。

謝謝

通過將以下兩個鏈接合並為一個函數而不是兩個函數來解決此問題。 我的蜘蛛現在工作如下:

class mySpider(scrapy.Spider):
name = 'techbot'
start_urls = ['']

def parse(self, response):
    #Extracts links
    items = []

    list1 = response.css("").extract()
    for i in list1:
        link1 = 'https:...' + str(i)
        request = Request(link2, self.parse, dont_filter =True)
        request.meta['item'] = items
        yield request

    list2 = response.css("").extract()
    for i in list2:
        link2 = '' + str(i)
        request = Request(link1, self.parseInfo2, dont_filter =True)
        request.meta['item'] = items
        yield request

    yield items

def parseInfo2(self, response):
    #Extracts all data
    item = myItems()
    items = response.meta['item']
    item['info1'] = response.css("").extract()
    item['info2'] = response.css("").extract()
    item['info3'] = response.css("").extract()
    item['info4'] = response.css("").extract()
    items.append(item)
    return items

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM