简体   繁体   English

将抓取的数据导出到CSV文件

[英]Exporting scraped data to a CSV file

I'm trying to get data from a website that requires me to follow 2 URLs before scraping the data. 我正在尝试从要求我在抓取数据之前遵循2个URL的网站获取数据。

The goal is to get an exported file that looks like this: 目的是获得一个如下所示的导出文件:

清除电子表格中的数据,没有空格或间隙

My code is as follows: 我的代码如下:

import scrapy
from scrapy.item import Item, Field
from scrapy import Request

class myItems(Item):
    info1 = Field()
    info2 = Field()
    info3 = Field()
    info4 = Field()

class mySpider(scrapy.Spider):
    name = 'techbot'
    start_urls = ['']

    def parse(self, response):
        #Extracts first link
        items = []

        list1 = response.css("").extract() #extract all info from here

        for i in list1:
            link1 = 'https:...' + str(i)
            request = Request(link1, self.parseInfo1, dont_filter =True)
            request.meta['item'] = items
            yield request

        yield items

    def parseInfo1(self, response):
        #Extracts second link
        item = myItems()
        items = response.meta['item']

        list1 = response.css("").extract()
        for i in list1:
            link1 = '' + str(i)
            request = Request(link1, self.parseInfo2, dont_filter =True)
            request.meta['item'] = items
            items.append(item)
            return request

    def parseInfo2(self, response):
        #Extracts all data
        item = myItems()
        items = response.meta['item']
        item['info1'] = response.css("").extract()
        item['info2'] = response.css("").extract()
        item['info3'] = response.css("").extract()
        item['info4'] = response.css("").extract()
        items.append(item)
        return items

I've executed the spider in the terminal with the command: 我已经在终端中使用以下命令执行了Spider:

scrapy crawl techbot

The data I get is out of order, and with gaps like this: 我得到的数据是乱序的,并且有这样的差距: 乱序数据

For example it scrapes the first set of data multiple times and the rest is out of order. 例如,它将多次刮擦第一组数据,而其余的则乱序。

If anyone could point me in the direction to get the results in a cleaner format as shown in the beginning that would be greatly appreciated. 如果有人能指出我的指示,使之以一开始显示的更干净的格式获得结果,将不胜感激。

Thanks 谢谢

Solved it by consolidating the following of both links into one function instead of two. 通过将以下两个链接合并为一个函数而不是两个函数来解决此问题。 My spider is working now as follows: 我的蜘蛛现在工作如下:

class mySpider(scrapy.Spider):
name = 'techbot'
start_urls = ['']

def parse(self, response):
    #Extracts links
    items = []

    list1 = response.css("").extract()
    for i in list1:
        link1 = 'https:...' + str(i)
        request = Request(link2, self.parse, dont_filter =True)
        request.meta['item'] = items
        yield request

    list2 = response.css("").extract()
    for i in list2:
        link2 = '' + str(i)
        request = Request(link1, self.parseInfo2, dont_filter =True)
        request.meta['item'] = items
        yield request

    yield items

def parseInfo2(self, response):
    #Extracts all data
    item = myItems()
    items = response.meta['item']
    item['info1'] = response.css("").extract()
    item['info2'] = response.css("").extract()
    item['info3'] = response.css("").extract()
    item['info4'] = response.css("").extract()
    items.append(item)
    return items

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM