简体   繁体   中英

How to get scrape information from a textbook buyback website?

I am making a program and one of the parts is to figure out the best buyback price of a textbook. I am trying to web scrape the value from " https://bookscouter.com " for example, " https://bookscouter.com/prices.php?isbn=1285428226&searchbutton=Sell " value is 34$. The problem is that the website is definitely not static and simple python scraping doesn't really work. How would I go about this? Some sort of request? I am not a very experienced with web work so any advice would be appreciated. Best,

This page use Ajax to fetch some additional information. The source code of https://bookscouter.com/prices.php?isbn=1285428226&searchbutton=Sell shows

<script language="javascript" type="text/javascript">
    function fetchresults_cb(search_id, text) {
        replaceContent('price_results', text);
        if(text.match(/INCOMPLETE/i)) {
            currentTime = new Date();
            time = currentTime.getTime();
            delayfunc = "AjaxRetrieve('/ajax_prices.php?type=PREFERRED&isbn=1285428226&search_id="+search_id+"&ts="+time+"', 'fetchresults_cb(\\'"+search_id+"\\', THISREQ.responseText)', 'true');";
            setTimeout(delayfunc, 3000);
        }

</script>

There is a different way to parse this kind of page.

The first way is re-implement above source code in Python and fetch additional resources like browsers do it during JavaScript execution. You can analysis full source code of page or use network monitor to identify URL address where required information is available.

The second way is to use Selenium which use browser engine to execute JavaScript and provide full source code with all required information.

I believe that you have permissions of database owner of bookscouter.com to perform this kind of activity.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM