简体   繁体   中英

Nested JSON items with scrapy

here is my basic scrapy crawler:

  def parse(self, response):        
    item = CruiseItem()     

    item['Cruise'] = {}
    item['Cruise']['Cruiseline'] = response.xpath('//title/text()').extract()
    item['Cruise']['Itinerary'] = response.xpath('//*[@id="brochureName1"]/text()').extract()
    item['Cruise']['Price'] = response.xpath('//*[@id="interiorPrice1"]/text()').extract()
    item['Cruise']['PerNight'] = response.xpath('//*[@id="perNightinteriorPrice1"]/text()').extract()

    return item

This works great in pulling in all the right elements that I want. My json feed for example turns out the following:

[

{
    "Cruise": {
        "Cruiseline": [
            "Ship Name"
        ],
        "Itinerary": [
            "3 Night Bahamas ",
            "4 Night Western Caribbean ",
            "4 Night Bahamas ",
            "3 Night Bahamas ",
            "5 Night Western Caribbean ",
            "5 Night Eastern Caribbean ",
            "7 Night Western Caribbean ",
            "7 Night Southern Caribbean ",
            "6 Night Western Caribbean ",
            "7 Night Western Caribbean ",
            "8 Night Eastern Caribbean "
        ],
        "Price": [
            "$169",
            "$179",
            "$289",
            "$349",
            "$359",
            "$389",
            "$389",
            "$409",
            "$424",
            "$524",
            "$939"
        ],
        "PerNight": [
            "$56/night",
            "$45/night",
            "$72/night",
            "$116/night",
            "$72/night",
            "$78/night",
            "$56/night",
            "$58/night",
            "$71/night",
            "$75/night",
            "$117/night"
        ]
    }
}
]

The goal json output is different however:

[

{
    "Cruise": {
        "Cruiseline": [
            "Ship Name"
        ],
        "Itinerary": [
            "3 Night Bahamas "
        ],
        "Price": [
            "$169"
        ],
        "PerNight": [
            "$56/night"

        ]
    },
    "Cruise": {
        "Cruiseline": [
            "Ship Name"
        ],
        "Itinerary": [
            "4 Night Bahamas "
        ],
        "Price": [
            "$79"
        ],
        "PerNight": [
            "$86/night"
        ]
    }
}
]

Essentially i want to return each cruise line with only 1 of each ship, itinerary, price, and per night.

Does this make sense? Would love to discuss

EDIT: asked this a few days ago, but decided to clarify and repost. Thanks!

Try re-formatting the data using this script. The formatted data will live in updated_list

cruise_list = [

{
    "Cruise": {
        "Cruiseline": [
            "Ship Name"
        ],
        "Itinerary": [
            "3 Night Bahamas ",
            "4 Night Western Caribbean ",
            "4 Night Bahamas ",
            "3 Night Bahamas ",
            "5 Night Western Caribbean ",
            "5 Night Eastern Caribbean ",
            "7 Night Western Caribbean ",
            "7 Night Southern Caribbean ",
            "6 Night Western Caribbean ",
            "7 Night Western Caribbean ",
            "8 Night Eastern Caribbean "
        ],
        "Price": [
            "$169",
            "$179",
            "$289",
            "$349",
            "$359",
            "$389",
            "$389",
            "$409",
            "$424",
            "$524",
            "$939"
        ],
        "PerNight": [
            "$56/night",
            "$45/night",
            "$72/night",
            "$116/night",
            "$72/night",
            "$78/night",
            "$56/night",
            "$58/night",
            "$71/night",
            "$75/night",
            "$117/night"
        ]
    }
}
]

updated_list = []

for cruise_obj in cruise_list:
    cruise_data = cruise_obj['Cruise']
    for i in range(len(cruise_data['Itinerary'])):
        sub_item = {}
        sub_item['Cruise'] = {}
        sub_item['Cruise']['Cruiseline'] = cruise_data['Cruiseline']
        sub_item['Cruise']['Itinerary'] = [cruise_data['Itinerary'][i]]
        sub_item['Cruise']['Price'] = [cruise_data['Price'][i]]
        sub_item['Cruise']['PerNight'] = [cruise_data['PerNight'][i]]
        updated_list.append(sub_item)

Some other thoughts

  • If the only things being stored in your json are cruise objects, then the initial key of Cruise is kind of redundant

  • A lot of the times, you're storing stuff in arrays that don't need to be. I'm guessing this is a scrapy problem, but you should try and modify my script a bit to remove the arrays for singular values. Eg there shouldn't be multiple Cruiseline s for a cruise object. Let me know if you need help with that.

Figured it out.

def parse(self, response):

    final_list = []

    item = WthItem()

    item['ship'] = response.xpath('//*[@id="shipName1"]/text()').extract()
    item['Itinerary'] = response.xpath('//*[@id="brochureName1"]/text()').extract()
    item['Price'] = response.xpath('//*[@id="interiorPrice1"]/text()').extract()
    item['PerNight'] = response.xpath('//*[@id="perNightinteriorPrice1"]/text()').extract()

    final_list.append(item)

    updated_list = []

    for item in final_list:
        for i in range(len(item['ship'])):
            sub_item = {}
            sub_item['entry'] = {}
            sub_item['entry']['ship'] = [item['ship'][i]]
            sub_item['entry']['Itinerary'] = [item['Itinerary'][i]]
            sub_item['entry']['Price'] = [item['Price'][i]]
            sub_item['entry']['PerNight'] = [item['PerNight'][i]]
            updated_list.append(sub_item)

            print sub_item

        return updated_list

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM