简体   繁体   中英

Can't fetch some content from a webpage using post requests

I've created a script in python in association with selenium to scrape some content located within a box like container in it's left sidebar from a webpage. When I use selenium I can get them without any trouble. Now, i would like to get the same content using requests module. I did some experiments in dev tools and noticed that there is a post requests being sent which produces some json response that I've pasted below. However, at this point I'm stuck as to how I can fetch the content using requests.

webpage link

Selenium approch:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

def get_content(link):
    driver.get(link)
    wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "#tab-outline"))).click()
    for item in wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, "#pageoutline > [class^='outline_H']"))):
        print(item.text)

if __name__ == '__main__':
    url = "http://wave.webaim.org/report#/www.onewerx.com"
    with webdriver.Chrome() as driver:
        wait = WebDriverWait(driver,10)
        get_content(url)

Partial output the script produces (as desired):

Marketing Mix Modeling
Programmatic & Modeling
Programmatic is buying digital advertising space automatically, with computers using data to decide which ads to buy and how much to pay for them.
Modern
Efficient
Scalable
Resultative
What is Modeling?
Modeling is an analytical approach that uses historic information, such as syndicated point-of-sale data and companies’ internal data, to quantify the sales impact of various marketing activities.
Programmatic - future of the marketing

When trying with requests:

import requests

url = "http://wave.webaim.org/data/request.php"

headers = {
    'Referer': 'http://wave.webaim.org/report',
    'X-Requested-With': 'XMLHttpRequest'
}

res = requests.post(url,data={'source':'http://www.onewerx.com'},headers=headers)
print(res.json())

I get the following output:

{'success': True, 'reportkey': '6520439253ac21885007b52c677b8078', 'contenttype': 'text/html; charset=UTF-8'}

How can I get the same content using requests?

To be clearer: This is what I'm interested in .

The output above looks different from the image because the selenium script click on the following button attached to that box to expand the content:

在此处输入图片说明

Ok, I've done a bit of reverse engineering.
It seems like the whole process runs on the client side. Here's how:

wave.engine.statistics contains the result you're looking for:

// wave.min.js

wave.fn.applyRules = function() {
    var e = {};
    e.statistics = {};
    try {
        e.categories = wave.engine.run(),
        e.statistics = wave.engine.statistics;
        wave.engine.ruleTimes;
        e.statistics.pagetitle = wave.page.title,
        e.statistics.totalelements = wave.allTags.length,
        e.success = !0
    } catch (t) {
        console.log(t)
    }
    return e
}

Here wave.engine.run function runs all rules on the client side. s is the <body> element:

规则

and returns the results

wave.engine.run = function(e) {
    var t = new Date
      , n = null
      , i = null
      , a = new Date;
    wave.engine.fn.calculateContrast(this.fn.getBody());
    var o = new Date
      , r = wave.rules
      , s = $(wave.page);
    if (e)
        r[e] && r[e](s);
    else
        for (e in r) {
            n = new Date;
            try {
                r[e](s)
            } catch (l) {
                console.log("RULE FAILURE(" + e + "): " + l.stack)
            }
            i = new Date,
            this.ruleTimes[e] = i - n,
            config.debug && console.log("RULE: " + e + " (" + this.ruleTimes[e] + "ms)")
        }
    return EndTimer = new Date,
    config.debug && console.log("TOTAL RULE TIME: " + (EndTimer - t) + "ms"),
    a = new Date,
    wave.engine.fn.structureOutput(),
    o = new Date,
    wave.engine.results
}

So you have two options: port these rules into Python, or keep using Selenium.

wave.rules = {},
wave.rules.text_justified = function(e) {
    e.find("p, div, td").each(function(t, n) {
        var i = e.find(n);
        "justify" == i.css("text-align") && wave.engine.fn.addIcon(n, "text_justified")
    })
}
,
wave.rules.alt_missing = function(e) {
    wave.engine.fn.overrideby("alt_missing", ["alt_link_missing", "alt_map_missing", "alt_spacer_missing"]),
    e.find("img:not([alt])").each(function(e, t) {
        var n = $(t);
        void 0 != n.attr("title") && 0 != n.attr("title").length || wave.engine.fn.addIcon(t, "alt_missing")
    })
}
// ... and many more

Since the tests rely on the browser engine to render a page fully (reports are not generated on the cloud unfortunately), you have to use Selenium for this job

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM