簡體   English   中英

將 output 保存為 JSON 格式

[英]Saving output the to JSON format

我正在嘗試將我的 output 即og = OpenGraph(i, ["og:title", "og:description", "og:image", "og:url"])寫入 JSON 文件。 但是當我看到驗證 output 時,它說它不是正確的 JSON 標准共振峰。 誰能幫助我,我做錯了什么。

# -*- coding: utf-8 -*-
import scrapy
from..items import news18Item
import re
from webpreview import web_preview
from webpreview import OpenGraph
import json

class News18SSpider(scrapy.Spider):
    name = 'news18_story'
    page_number = 2
    start_urls = ['https://www.news18.com/movies/page-1/']

    def parse(self, response):
        items = news18Item()
        page_id = response.xpath('/html/body/div[2]/div[5]/div[2]/div[1]/div[*]/div[*]/p/a/@href').getall()
        items['page_id'] = page_id

        story_url = page_id

        for i in story_url :
            og = OpenGraph(i, ["og:title", "og:description", "og:image", "og:url"])

            dictionary =[{ "page_title": og.title }, { "description": og.description }, { "image_url": og.image }, { "post_url": og.url}] 

            with open("news18_new.json", "a") as outfile: 
                json.dump(dictionary, outfile)
                outfile.write("\n")
                # json.dump("\n",outfile) 



        next_page = 'https://www.news18.com/movies/page-' + str(News18SSpider.page_number) + '/'
        if News18SSpider.page_number <= 20:
           News18SSpider.page_number += 1  
           yield response.follow(next_page, callback = self.parse)

        pass

這是最小的工作代碼。

您可以將所有代碼放在一個文件script.py中並作為python script.py運行,而無需創建項目。

我將每個項目都作為單個字典

  yield {
            "page_title": og.title,
            "description": og.description,
            "image_url": og.image,
            "post_url": og.url
        } 

scrapy將其保存為正確JSON文件,其中包含一個包含多個字典的列表。

您創建了許多單獨的列表 - 這不是正確的 JSON 格式。

JSON文件不是可以 append 新數據的格式。 它必須將所有數據讀取到 memory、append 新項目到 memory 中的數據,並將所有數據再次保存到文件中。

您可以 append 到CSV文件,而無需將所有數據讀取到 memory。


import scrapy
from webpreview import OpenGraph

class News18SSpider(scrapy.Spider):

    name = 'news18_story'
    page_number = 1
    start_urls = ['https://www.news18.com/movies/page-1/']

    def parse(self, response):
        #all_hrefs = response.xpath('/html/body/div[2]/div[5]/div[2]/div[1]/div[*]/div[*]/p/a/@href').getall()
        all_hrefs = response.xpath('//div[@class="blog-list-blog"]/p/a/@href').getall()

        for href in all_hrefs:
            og = OpenGraph(href, ["og:title", "og:description", "og:image", "og:url"])

            yield {
                "page_title": og.title,
                "description": og.description,
                "image_url": og.image,
                "post_url": og.url
            } 

        if self.page_number <= 20:
            self.page_number += 1  
            next_url = 'https://www.news18.com/movies/page-{}/'.format(self.page_number)
            #yield response.follow(next_url) # , callback=self.parse)
            yield scrapy.Request(next_url)

# --- run without project and save in `output.json` ---

from scrapy.crawler import CrawlerProcess

c = CrawlerProcess({
    #'USER_AGENT': 'Mozilla/5.0',
    'USER_AGENT': 'Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Firefox/78.0',

    # save in file CSV, JSON or XML
    'FEED_FORMAT': 'json',     # csv, json, xml
    'FEED_URI': 'output.json', #
})

c.crawl(News18SSpider)
c.start() 

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM