简体   繁体   English

如何在Scrapy中使用包含URL和ID的JSON并构造结果?

[英]How to use JSON containing URL and ID in Scrapy and structure the results?

I'm using Scrapy to scrape datas from 9000+ urls contained in a JSON with a matching ID. 我正在使用Scrapy从具有匹配ID的JSON中包含的9000多个URL中抓取数据。 Here is my JSON object type: 这是我的JSON对象类型:

[{
"objectID": 10500,
"gm_url": "https://reddit.com/1"
},
"objectID": 10501,
"gm_url": "https://reddit.com/2"
}]

I'd like to have my results in a json with the scraped datas, the matching url and the id. 我想将我的结果保存在json中,其中包含抓取的数据,匹配的url和id。

[{
    "objectID": 10500,
    "gm_url": "https://reddit.com",
    "results": [
        {
            "model": "",
            "price": "",
            "auction": "",
            "date": "",
            "auction_url": "",
            "img": ""
        },
        {
            "model": "",
            "price": "",
            "auction": "",
            "date": "",
            "auction_url": "",
            "img": ""
        },
        {
            "model": "",
            "price": "",
            "auction": "",
            "date": "",
            "auction_url": "",
            "img": ""
        }
    ]
}]

Here is my code right now in scrapy (which is kind of messy): 这是我现在的代码很杂乱(有点杂乱):

import json
import scrapy

with open('/home/bolgi/Workspace/Dev/python_workspace/gm_spider/Json/db_urls_glenmarch_results_scrapy_reduced.json', encoding='utf-8') as data_file:
    data = json.load(data_file)

for item in data:
    objectId = item['objectID']
    gmUrl = item['gm_url']

    class GlenMarchSpider(scrapy.Spider):
        name = 'glenmarch'

        def start_requests(self):
            start_urls = gmUrl

            for url in start_urls:
                yield scrapy.Request(url, callback=self.parse)

        def parse(self, response):
            for caritem in response.css("div.car-item-border"):
                yield {
                    "url": response.url,
                    "model": caritem.css("div.make::text").get(),
                    "price": caritem.css("div.price::text").get(),
                    "auction": caritem.css("div.auctionHouse::text").get(),
                    "date": caritem.css("div.date::text").get(),
                    "auction_url": caritem.css("div.view-auction a::attr(href)").get(),
                    "img": caritem.css("img.img-responsive::attr(src)").get()
                }

I don't know how to structure the code and how to use the JSON file, I'm new in python and it's a bit difficult for me. 我不知道如何构造代码以及如何使用JSON文件,我是python的新手,这对我来说有点困难。

You should never declare a class inside a forloop. 您永远不要在forloop中声明一个类。

I suggest you the following structure: 我建议您采用以下结构:

import json
import scrapy

class GlenMarchSpider(scrapy.Spider):
    name = 'glenmarch'

    def __init__(self):
        with open('/home/bolgi/Workspace/Dev/python_workspace/gm_spider/Json/db_urls_glenmarch_results_scrapy_reduced.json', encoding='utf-8') as data_file:
            self.data = json.load(data_file)

    def start_requests(self):
        for item in self.data:
            request = scrapy.Request(item['gm_url'], callback=self.parse)
            request.meta['item'] = item
            yield request

    def parse(self, response):
        item = response.meta['item']
        item['results'] = []
        for caritem in response.css("div.car-item-border"):
            item['results'].append({
                "model": caritem.css("div.make::text").get(),
                "price": caritem.css("div.price::text").get(),
                "auction": caritem.css("div.auctionHouse::text").get(),
                "date": caritem.css("div.date::text").get(),
                "auction_url": caritem.css("div.view-auction a::attr(href)").get(),
                "img": caritem.css("img.img-responsive::attr(src)").get()
            })
        yield item

Then you can call your spider (and save it in a new json file): 然后,您可以调用蜘蛛(并将其保存在新的json文件中):

$ scrapy crawl glenmarch -o myjson.json -t json

If there are things in the code that you do not understand do not hesitate to ask for clarifications ! 如果代码中有您不理解的内容,请随时进行澄清! :) :)

The scrapy.spider also has start_urls list field which in default empty where you can append all the urls . scrapy.spider还具有start_urls列表字段,该字段默认为空,您可以在其中附加所有urls

import scrapy
import json

class GlenMarchSpider(scrapy.Spider)
    name = 'glenmarch'
    start_urls = []

    with open('/home/bolgi/Workspace/Dev/python_workspace/gm_spider/Json/db_urls_glenmarch_results_scrapy_reduced.json', encoding='utf-8') as json_file:
         data = json.load(json_file)
         for item in data:
             objectId = item['objectID']
             gmUrl = item['gm_url']
             start_urls.append(gmUrl)

    def parse(self, response):
        item = {}
        for caritem in response.css("div.car-item-border"):
            yield {
                "url": response.url,
                "model": caritem.css("div.make::text").get(),
                "price": caritem.css("div.price::text").get(),
                "auction": caritem.css("div.auctionHouse::text").get(),
                "date": caritem.css("div.date::text").get(),
                "auction_url": caritem.css("div.view-auction a::attr(href)").get(),
                "img": caritem.css("img.img-responsive::attr(src)").get()
            }

And you can run the spider in that way too: 您也可以通过这种方式运行Spider:

scrapy runspider quotes_spider.py -o glenmarch.json

For more details, please check out the official document or feel free to ask. 有关更多详细信息,请查阅官方文档或随时询问。 https://scrapy.readthedocs.io/en/latest/intro/overview.html https://scrapy.readthedocs.io/en/latest/intro/overview.html

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM