简体   繁体   中英

Issue with extracting data from a javascript structure on a website using beautiful soup in Python

I am trying to scrape data from a website which uses Javascript structure to load the data. I used solution to this question Issue with html tags while scraping data using beautiful soup to accomplish that. After, getting the JSON data dictionary I iterated over it to successfully get the device name and price data.

Code mentioned in the solution of above mentioned question is actually extracting data from a window having device name and price with its attribute mentioned in code as window.rates .

Problem: If you look at the structure of website, there are 3 parts in it.

  1. 1st part contains window with plan name and its other details
  2. 2nd part contains window with device name and price (this is the window from which I am currently scraping data)
  3. 3rd Part contains Plan Name, Device Name, Price and its Monthly prices

I want to extract data from the third part as I want all 4 fields(Plan name, device name, price, monthly price) . I am able to scrape data from 1st & 2nd part using solution to above mentioned question though.

Now, I am not able to find the javascript which is loading the data in 3rd part , also the attribute(Eg. window.rates for 2nd part) which I will have to use to get the JSON dictionary of data for 3rd part.

Also, data in 3rd part of website changes as we scroll the windows in 2nd part.

PS: I tried printing all the scripts running on the page to find out the script which is loading the data in 3rd part but it was not of any help.

Please help me in solving this issue.

You provided a link to your previous question that mentions the site you're interested in:

http://www.vodafone.de/privat/tarife/red-smartphone-tarife.html

You just have to look at the code.

Say you select "Red M" as the plan and "Samsung Galaxy SIII Blau (Blue) / 16 GB. The bottom section will display:

Detail Items

Einmalige Kosten (One-time costs)

  1. Anschlusspreis (Activation Charge): 29.99
  2. Einmalzahlung (Onetime Payment) Smartphone: 9.90

    Monatliche Kosten (Monthly Charges)

  3. Red M 59.99

  4. 24 x 10 % Rabatt (discount) -6.00
  5. 24 x 5 Euro Smartphone-Rabatt -5.00

  6. Also one of three 10.00/month discounts are available for being a student, young, or handicapped.

You need to parse (maybe using Python's JSON module) these JavaScript assignments:

window.phones
window.rates
window.discounts
window.goodies
window.promotions

I'll walk you through the data structures. You'll have to write the code yourself.

windows.phones , contains this entry (keeping with our example):

window.phones = {
    sku1224225:{
        name:"Samsung Galaxy SIII Blau 16 GB",
        image:"/images/m1057472_300599.jpg",
        deliveryTime:"Lieferbar innerhalb 48 Stunden",
        sku1444275:{p:"prod1334441",e:"49.90"}, // "Vodafone Red S"
        sku1444283:{p:"prod1334441",e:"9.90"},  // "Vodafone Red M"
        sku1444291:{p:"prod1334441",e:"9.90"},  // "Vodafone Red Premium"
        sku1444286:{p:"prod1334441",e:"9.90"},  // "Vodafone Red L"
        sku1104261:{p:"prod1334441",e:"99.90"}  // "Vodafone Basic 100"
    },
    // . . .
}

I've added comments to show the plan names.

Here we see Detail Item 2.

The SKUs listed here are plan sub-SKUs defined in window.rates . For "Red M" we have:

window.rates = {
    sku1444279:{
        label:"Vodafone Red M",
        propId:"prod1564453",
        subsku:{
            sku1444283:{    // "Samsung Galaxy SIII Blau 16 GB", etc.
                monthlyChargest:"59.99",
                activationCharge:"29.99",
                discounts:[
                    "sku140988",    // "Ich bin 18-25 Jahre jung" (-10)
                    "sku140989",    // "Ich habe einen Schwerbehindertenausweis" (-10)
                    "sku140990"     // "Ich bin Student und jünger als 30" (-10)
                ],
                promotions:["27"],  // "24 x 5 Euro Smartphone-Rabatt" (-5)
                Goodies:[
                    "prod1674486"   // "24 x 10 % Rabatt" (-6)
                ]
            },
            // more subskus here . . .
        }
    },
    // . . .
}

Again I've added comments for the linked data. Note, many devices can link to the same subsku.

We see Detail Items 1 & 3 and links to Items 4, 5, and 6.

Goodies links to windows.goodies via prod number:

window.goodies = {
    prod1674486:{
        SkuId:"prod1674486",
        Name:"24 x 10 % Rabatt",
        Value:"-6",
        Type:"absolute",
        DurationInMonth:"24"
    },
    // . . .
}

Which gives us Detail Item 4.

window.rates also links to windows.promotions via the subsku 's promotions list:

window.promotions = {
    27:{
        promotionId:"27",
        promotionName:"24 x 5 Euro Smartphone-Rabatt",
        promotionValue:"-5",
        Type:"absolute",
        duration_in_months:"24",
        deeplinkParameter:""
    },
    // . . .
}

Which gives us Detail Item 5.

windows.discounts contains the special discounts for Detail Item 6:

window.discounts = {
    sku140988:{
        SkuId:"sku140988",
        Name:"Ich bin 18-25 Jahre jung",
        Type:"absolute",
        DurationInMonth:"24",
        Value:{
            sku1444295:"-10",   // "Vodafone Red Premium"
            sku1444279:"-10",   // "Vodafone Red M"
            sku1444290:"-20"}   // "Vodafone Red L"
    },
    sku140989:{
        SkuId:"sku140989",
        Name:"Ich habe einen Schwerbehindertenausweis",
        Type:"absolute",
        DurationInMonth:"24",
        Value:{
            sku1444295:"-10",   // "Vodafone Red Premium"
            sku1444279:"-10",   // "Vodafone Red M"
            sku1444290:"-20"}   // "Vodafone Red L"
    },
    sku140990:{
        SkuId:"sku140990",
        Name:"Ich bin Student und jünger als 30",
        Type:"absolute",
        DurationInMonth:"24",
        Value:{
            sku1444295:"-10",   // "Vodafone Red Premium"
            sku1444279:"-10",   // "Vodafone Red M"
            sku1444290:"-20"}   // "Vodafone Red L"
    }
};

The proper discount amount is selected by plan major SKU (via the SKUs listed under value ).

And that's it. Just parse these 5 objects into Python objects and you'll have all the data you need.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM