简体   繁体   English

使用Python中的漂亮汤从网站上的javascript结构中提取数据的问题

[英]Issue with extracting data from a javascript structure on a website using beautiful soup in Python

I am trying to scrape data from a website which uses Javascript structure to load the data. 我正在尝试从使用Javascript结构加载数据的网站上抓取数据。 I used solution to this question Issue with html tags while scraping data using beautiful soup to accomplish that. 我使用html标签解决了这个问题,同时使用漂亮的汤抓取数据来完成此任务。 After, getting the JSON data dictionary I iterated over it to successfully get the device name and price data. 之后,获取JSON数据字典,然后对其进行迭代以成功获取设备名称和价格数据。

Code mentioned in the solution of above mentioned question is actually extracting data from a window having device name and price with its attribute mentioned in code as window.rates . 上述问题的解决方案中提到的代码实际上是从具有设备名称和价格的窗口中提取数据,其属性在代码中称为window.rates

Problem: If you look at the structure of website, there are 3 parts in it. 问题:如果您查看网站的结构,其中包含三个部分。

  1. 1st part contains window with plan name and its other details 第一部分包含带有计划名称及其其他详细信息的窗口
  2. 2nd part contains window with device name and price (this is the window from which I am currently scraping data) 第二部分包含带有设备名称和价格的窗口(这是我当前从中抓取数据的窗口)
  3. 3rd Part contains Plan Name, Device Name, Price and its Monthly prices 第三部分包含计划名称,设备名称,价格及其每月价格

I want to extract data from the third part as I want all 4 fields(Plan name, device name, price, monthly price) . 我想从第三部分提取数据,因为我想要所有4个字段(计划名称,设备名称,价格,月度价格) I am able to scrape data from 1st & 2nd part using solution to above mentioned question though. 我可以使用上述问题的解决方案从第一和第二部分抓取数据。

Now, I am not able to find the javascript which is loading the data in 3rd part , also the attribute(Eg. window.rates for 2nd part) which I will have to use to get the JSON dictionary of data for 3rd part. 现在, 我无法找到正在第3部分中加载数据的javascript ,还有我将不得不用来获取第3部分数据的JSON字典的属性(例如,第2部分的window.rates)。

Also, data in 3rd part of website changes as we scroll the windows in 2nd part. 此外,当我们滚动第二部分中的窗口时,网站第三部分中的数据也会更改。

PS: I tried printing all the scripts running on the page to find out the script which is loading the data in 3rd part but it was not of any help. PS:我尝试打印运行在页面上的所有脚本,以找出正在第3部分中加载数据的脚本,但这没有任何帮助。

Please help me in solving this issue. 请帮助我解决这个问题。

You provided a link to your previous question that mentions the site you're interested in: 您提供了到上一个问题的链接,其中提到了您感兴趣的网站:

http://www.vodafone.de/privat/tarife/red-smartphone-tarife.html http://www.vodafone.de/privat/tarife/red-smartphone-tarife.html

You just have to look at the code. 您只需要看一下代码。

Say you select "Red M" as the plan and "Samsung Galaxy SIII Blau (Blue) / 16 GB. The bottom section will display: 假设您选择“ Red M”作为计划,并选择“ Samsung Galaxy SIII Blau(蓝色)/ 16 GB”,底部将显示:

Detail Items 详细项目

Einmalige Kosten (One-time costs) Einmalige Kosten(一次性费用)

  1. Anschlusspreis (Activation Charge): 29.99 Anschlusspreis(激活费):29.99
  2. Einmalzahlung (Onetime Payment) Smartphone: 9.90 Einmalzahlung(一次性付款)智能手机:9.90

    Monatliche Kosten (Monthly Charges) Monatliche Kosten(按月收费)

  3. Red M 59.99 红色M 59.99

  4. 24 x 10 % Rabatt (discount) -6.00 24 x 10%拉巴特(折扣)-6.00
  5. 24 x 5 Euro Smartphone-Rabatt -5.00 24 x 5欧元智能手机-拉巴特-5.00

  6. Also one of three 10.00/month discounts are available for being a student, young, or handicapped. 另外,学生,年轻人或残障人士也可享受三个10.00 /月的折扣之一。

You need to parse (maybe using Python's JSON module) these JavaScript assignments: 您需要解析(也许使用Python的JSON模块)这些JavaScript分配:

window.phones
window.rates
window.discounts
window.goodies
window.promotions

I'll walk you through the data structures. 我将向您介绍数据结构。 You'll have to write the code yourself. 您必须自己编写代码。

windows.phones , contains this entry (keeping with our example): windows.phones ,包含以下条目(与我们的示例保持一致):

window.phones = {
    sku1224225:{
        name:"Samsung Galaxy SIII Blau 16 GB",
        image:"/images/m1057472_300599.jpg",
        deliveryTime:"Lieferbar innerhalb 48 Stunden",
        sku1444275:{p:"prod1334441",e:"49.90"}, // "Vodafone Red S"
        sku1444283:{p:"prod1334441",e:"9.90"},  // "Vodafone Red M"
        sku1444291:{p:"prod1334441",e:"9.90"},  // "Vodafone Red Premium"
        sku1444286:{p:"prod1334441",e:"9.90"},  // "Vodafone Red L"
        sku1104261:{p:"prod1334441",e:"99.90"}  // "Vodafone Basic 100"
    },
    // . . .
}

I've added comments to show the plan names. 我添加了注释以显示计划名称。

Here we see Detail Item 2. 在这里,我们看到详细信息项目2。

The SKUs listed here are plan sub-SKUs defined in window.rates . 此处列出的SKU是window.rates定义的计划子SKU。 For "Red M" we have: 对于“红色M”,我们有:

window.rates = {
    sku1444279:{
        label:"Vodafone Red M",
        propId:"prod1564453",
        subsku:{
            sku1444283:{    // "Samsung Galaxy SIII Blau 16 GB", etc.
                monthlyChargest:"59.99",
                activationCharge:"29.99",
                discounts:[
                    "sku140988",    // "Ich bin 18-25 Jahre jung" (-10)
                    "sku140989",    // "Ich habe einen Schwerbehindertenausweis" (-10)
                    "sku140990"     // "Ich bin Student und jünger als 30" (-10)
                ],
                promotions:["27"],  // "24 x 5 Euro Smartphone-Rabatt" (-5)
                Goodies:[
                    "prod1674486"   // "24 x 10 % Rabatt" (-6)
                ]
            },
            // more subskus here . . .
        }
    },
    // . . .
}

Again I've added comments for the linked data. 再次,我为链接的数据添加了注释。 Note, many devices can link to the same subsku. 注意,许多设备可以链接到相同的subsku。

We see Detail Items 1 & 3 and links to Items 4, 5, and 6. 我们看到详细项目1和3以及指向项目4、5和6的链接。

Goodies links to windows.goodies via prod number: Goodies通过prod编号链接到windows.goodies

window.goodies = {
    prod1674486:{
        SkuId:"prod1674486",
        Name:"24 x 10 % Rabatt",
        Value:"-6",
        Type:"absolute",
        DurationInMonth:"24"
    },
    // . . .
}

Which gives us Detail Item 4. 这给了我们详细信息4。

window.rates also links to windows.promotions via the subsku 's promotions list: window.rates还通过subskupromotions列表链接到windows.promotions

window.promotions = {
    27:{
        promotionId:"27",
        promotionName:"24 x 5 Euro Smartphone-Rabatt",
        promotionValue:"-5",
        Type:"absolute",
        duration_in_months:"24",
        deeplinkParameter:""
    },
    // . . .
}

Which gives us Detail Item 5. 这给了我们详细信息5。

windows.discounts contains the special discounts for Detail Item 6: windows.discounts包含“明细项目6”的特殊折扣:

window.discounts = {
    sku140988:{
        SkuId:"sku140988",
        Name:"Ich bin 18-25 Jahre jung",
        Type:"absolute",
        DurationInMonth:"24",
        Value:{
            sku1444295:"-10",   // "Vodafone Red Premium"
            sku1444279:"-10",   // "Vodafone Red M"
            sku1444290:"-20"}   // "Vodafone Red L"
    },
    sku140989:{
        SkuId:"sku140989",
        Name:"Ich habe einen Schwerbehindertenausweis",
        Type:"absolute",
        DurationInMonth:"24",
        Value:{
            sku1444295:"-10",   // "Vodafone Red Premium"
            sku1444279:"-10",   // "Vodafone Red M"
            sku1444290:"-20"}   // "Vodafone Red L"
    },
    sku140990:{
        SkuId:"sku140990",
        Name:"Ich bin Student und jünger als 30",
        Type:"absolute",
        DurationInMonth:"24",
        Value:{
            sku1444295:"-10",   // "Vodafone Red Premium"
            sku1444279:"-10",   // "Vodafone Red M"
            sku1444290:"-20"}   // "Vodafone Red L"
    }
};

The proper discount amount is selected by plan major SKU (via the SKUs listed under value ). 适当的折扣金额由计划的主要SKU(通过value下方列出的SKU)选择。

And that's it. 就是这样。 Just parse these 5 objects into Python objects and you'll have all the data you need. 只需将这5个对象解析为Python对象,即可获得所需的所有数据。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM