简体   繁体   中英

How to scrape the output of widgets on a website using python/scrapy?

I am trying to scrape the ads on a site...

This site for example

http://www.bestyling.com/15-of-the-most-expensive-shoes-ever-and-you-wont-believe-whats-1/?utm_source=Ourbrain&utm_medium=cpc&utm_campaign=15%20Shoes%20-%20Desktop%20USA

i am trying to get the ads from from this a

/html/body[@class='single single-post postid-171 single-format-standard custom-background hasGoogleVoiceExt']/div[@id='site']/div[@id='site-out']/div[@id='site-fixed']/div[@id='content-out']/div[@id='content-in']/div[@id='main-content-wrap']/div[@id='main-content-contain']/div[@id='content-wrap']/div[@class='sec-marg-out4 relative']/div[@class='sec-marg-in4']/article[@class='post-171 post type-post status-publish format-standard hentry category-uncategorized']/div[@id='post-area']/div[@class='post-body-out']/div[@class='post-body-in']/div[@id='content-area']/div[@class='content-area-cont left relative']/div[@class='sec-marg-out relative']/div[@class='sec-marg-in']/div[@class='content-area-out']/div[@class='content-area-in']/div[@class='content-main left relative']/div[@id='article-ad']/div[1]/div[@id='ac_110238']/div[@class='ac_adbox']/div[@class='ac_adbox_inner']

'ac_container' or 'ac-adbox'

When i go to the page in a browser i see the ad, when i use scrapy to get the html

its a script

  <div id="contentad110238"></div>
   <script type="text/javascript">
        (function(d) {
        var params =
        {
           id: "d12cd6f3-b896-443b-9140-07e35e66e222",
           d:  "YmVzdHlsaW5nLmNvbQ==",
           wid: "110238",
           cb: (new Date()).getTime()
        };

    var qs=[];
    for(var key in params) qs.push(key+'='+encodeURIComponent(params[key]));
    var s = d.createElement('script');s.type='text/javascript';s.async=true;
    var p = 'https:' == document.location.protocol ? 'https' : 'http';
    s.src = p + "://api.content.ad/Scripts/widget2.aspx?" + qs.join('&');
    d.getElementById("contentad110238").appendChild(s);
})(document);
</script>                                                       </div>

How do i scrape this? Any help would be appreciated... Im guessing i have to use a js renderer in python or scrapy.... Recommendations?

Those ads are fetched via Javascript so when you download raw HTML (like Scrapy does) you won't see them.

Although, you can take a look at Splash (formerly ScrapyJS) with Scrapy integration to seamlessly embed browser with Javascript. Directly from Scrapy developers.

Everything is in Python, except Qt for browser rendering.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM