[英]Saving output the to JSON format
我正在嘗試將我的 output 即og = OpenGraph(i, ["og:title", "og:description", "og:image", "og:url"])
寫入 JSON 文件。 但是當我看到驗證 output 時,它說它不是正確的 JSON 標准共振峰。 誰能幫助我,我做錯了什么。
# -*- coding: utf-8 -*-
import scrapy
from..items import news18Item
import re
from webpreview import web_preview
from webpreview import OpenGraph
import json
class News18SSpider(scrapy.Spider):
name = 'news18_story'
page_number = 2
start_urls = ['https://www.news18.com/movies/page-1/']
def parse(self, response):
items = news18Item()
page_id = response.xpath('/html/body/div[2]/div[5]/div[2]/div[1]/div[*]/div[*]/p/a/@href').getall()
items['page_id'] = page_id
story_url = page_id
for i in story_url :
og = OpenGraph(i, ["og:title", "og:description", "og:image", "og:url"])
dictionary =[{ "page_title": og.title }, { "description": og.description }, { "image_url": og.image }, { "post_url": og.url}]
with open("news18_new.json", "a") as outfile:
json.dump(dictionary, outfile)
outfile.write("\n")
# json.dump("\n",outfile)
next_page = 'https://www.news18.com/movies/page-' + str(News18SSpider.page_number) + '/'
if News18SSpider.page_number <= 20:
News18SSpider.page_number += 1
yield response.follow(next_page, callback = self.parse)
pass
這是最小的工作代碼。
您可以將所有代碼放在一個文件script.py
中並作為python script.py
運行,而無需創建項目。
我將每個項目都作為單個字典
yield {
"page_title": og.title,
"description": og.description,
"image_url": og.image,
"post_url": og.url
}
和scrapy
將其保存為正確JSON
文件,其中包含一個包含多個字典的列表。
您創建了許多單獨的列表 - 這不是正確的 JSON 格式。
JSON
文件不是可以 append 新數據的格式。 它必須將所有數據讀取到 memory、append 新項目到 memory 中的數據,並將所有數據再次保存到文件中。
您可以 append 到CSV
文件,而無需將所有數據讀取到 memory。
import scrapy
from webpreview import OpenGraph
class News18SSpider(scrapy.Spider):
name = 'news18_story'
page_number = 1
start_urls = ['https://www.news18.com/movies/page-1/']
def parse(self, response):
#all_hrefs = response.xpath('/html/body/div[2]/div[5]/div[2]/div[1]/div[*]/div[*]/p/a/@href').getall()
all_hrefs = response.xpath('//div[@class="blog-list-blog"]/p/a/@href').getall()
for href in all_hrefs:
og = OpenGraph(href, ["og:title", "og:description", "og:image", "og:url"])
yield {
"page_title": og.title,
"description": og.description,
"image_url": og.image,
"post_url": og.url
}
if self.page_number <= 20:
self.page_number += 1
next_url = 'https://www.news18.com/movies/page-{}/'.format(self.page_number)
#yield response.follow(next_url) # , callback=self.parse)
yield scrapy.Request(next_url)
# --- run without project and save in `output.json` ---
from scrapy.crawler import CrawlerProcess
c = CrawlerProcess({
#'USER_AGENT': 'Mozilla/5.0',
'USER_AGENT': 'Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Firefox/78.0',
# save in file CSV, JSON or XML
'FEED_FORMAT': 'json', # csv, json, xml
'FEED_URI': 'output.json', #
})
c.crawl(News18SSpider)
c.start()
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.