简体   繁体   中英

How to scrape HTML from a javascript webpage using Python

Trying to parse the html in order to get data from tags nested inside of tags, but when I prettify I get javascript. How do I get the information out of this javascript? How do I turn it into html? Is there a better way to get this information? This is my first question and I apologize if I've made any mistakes. Thank you.

This is my code:

from bs4 import BeautifulSoup as bs
import requests

html = requests.get(url)
soup = bs(html.content, 'html.parser')
print(soup.prettify())

The response is: what looks like byte/string of pre-prettified code followed by

<html>
<head>
</head>
<script language="javascript">
var strUrl = window.location.href;


if (strUrl.indexOf("modisoftinc.com") > 0)
    window.location.replace("https://www.modisoftinc.com/home.html");
if (strUrl.indexOf("www.modisoftinc.com") > 0)
    window.location.replace("https://www.modisoftinc.com/home.html");
if (strUrl.indexOf("http://modisoftinc.com") > 0)
    window.location.replace("https://www.modisoftinc.com/home.html");
if (strUrl.indexOf("www.modisoftinc.com") > 0)
    window.location.replace("https://www.modisoftinc.com/home.html");


if (strUrl.indexOf("echecks.modisoftinc.com") > 0)
    window.location.replace("https://echecks.modisoftinc.com/Account/Logon");


if (strUrl.indexOf("pos.modisoftinc.com") > 0)
    window.location.replace("https://pos.modisoftinc.com/Account/Logon");


if (strUrl.indexOf("clock.modisoftinc.com") > 0)
    window.location.replace("https://clock.modisoftinc.com/Account/Logon");


if (strUrl.indexOf("admin11.modisoftinc.com") > 0)
    window.location.replace("https://admin11.modisoftinc.com/Account/Logon");




if (strUrl.indexOf("modisoft.com") > 0)
    window.location.replace("https://www.modisoft.com/home.html");
if (strUrl.indexOf("www.modisoft.com") > 0)
    window.location.replace("https://www.modisoft.com/home.html");
if (strUrl.indexOf("http://modisoft.com") > 0)
    window.location.replace("https://www.modisoft.com/home.html");
if (strUrl.indexOf("www.modisoft.com") > 0)
    window.location.replace("https://www.modisoft.com/home.html");


if (strUrl.indexOf("echecks.modisoft.com") > 0)
    window.location.replace("https://echecks.modisoft.com/Account/Logon");

if (strUrl.indexOf("app.modisoft.com") > 0)
    window.location.replace("https://app.modisoft.com/Account/Logon");

if (strUrl.indexOf("app1.modisoft.com") > 0)
    window.location.replace("https://app1.modisoft.com/Account/Logon");

if (strUrl.indexOf("app2.modisoft.com") > 0)
    window.location.replace("https://app2.modisoft.com/Account/Logon");

if (strUrl.indexOf("pos.modisoft.com") > 0)
    window.location.replace("https://pos.modisoft.com/Account/Logon");

if (strUrl.indexOf("clock.modisoft.com") > 0)
    window.location.replace("https://clock.modisoft.com/Account/Logon");

    if (strUrl.indexOf("admin11.modisoft.com") > 0)
    window.location.replace("https://admin11.modisoft.com/Account/Logon");



if (strUrl.indexOf("modisoftrewards.com") > 0)
    window.location.replace("https://www.modisoftrewards.com/index.html");
if (strUrl.indexOf("www.modisoftrewards.com") > 0)
    window.location.replace("https://www.modisoftrewards.com/index.html");
if (strUrl.indexOf("http://modisoftrewards.com") > 0)
    window.location.replace("https://www.modisoftrewards.com/index.html");
if (strUrl.indexOf("www.modisoftrewards.com") > 0)
    window.location.replace("https://www.modisoftrewards.com/index.html");






   if (strUrl.indexOf("localhost") > 0)
       window.location.replace("Account/Logon");
</script>
<body>
</body>
</html>

How do I get the information out of this javascript? How do I turn it into html?

Yes, you need a browser automation (selenium, headless Chrome) to execute on-site JS. Then upon that, the JS fills in HTML with missing data. Eg.:

  1. https://webscraping.pro/javascript-rendering-library-for-scraping-javascript-sites/

  2. https://webscraping.pro/java-library-to-scrape-linkedin-its-data-affiliates/

Hack

In some cases you might use a bare coding (python, php) to imitate JS requests (usually XHR/Ajax) and get the missing info. Eg. Scrape a JS Lazy load page by Python requests

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM