简体   繁体   中英

How to parse a Javascript rendered webpage with Python (BS4?)

All - I am using BS4 currently to parse a webpage. BS4 returns this block which is coded in JS as a string and cannot recognize the urls I am trying to extract.

I believe the part I am attempting to extract in BS4 is here:

              var vd1="\x3c\x73\x6f\x75\x72\x63\x65\x20\x73\x72\x63\x3d\x27";
              var vd2="\x27\x20\x74\x79\x70\x65\x3d\x27\x76\x69\x64\x65\x6f\x2f\x6d\x70\x34\x27\x3e";

              var luu=pkl("uggc://navzrurnira.rh/tvs.cuc?vcqrgrpgrq");


                    var soienfu="\x61\x78\x66\x52\x33\x64\x54\x46\x33\x72\x6b\x36\x36\x2f\x67\x4f\x65\x46\x4d\x6b\x49\x2f\x67\x6a\x77\x42\x67\x45\x66\x61\x62\x78\x33\x58\x66\x6c\x43\x44\x78\x70\x6c\x2b\x7a\x76\x38\x6b\x6c\x6a\x6b\x53\x41\x63\x4f\x34\x4a\x77\x47\x47\x78\x35\x2f\x7c\x71\x6e\x47\x49\x55\x4d\x6a\x34\x54\x48\x69\x34\x41\x45\x75\x4c\x78\x35\x58\x46\x4a\x37\x62\x65\x42\x52\x61\x37\x36\x53\x34\x6e\x46\x6f\x75\x49\x47\x55\x42\x7a\x57\x4e\x44\x61\x4a\x6a\x45\x55\x59\x56\x47\x54\x30\x3d"; soienfu=soienfu.replace(/\|/g,"1"); soienfu=vkl(soienfu); soienfu=dfgsr(soienfu);


                    var iusfdb="\x61\x78\x66\x52\x33\x64\x54\x46\x33\x72\x6b\x35\x36\x2b\x4d\x4f\x64\x46\x6b\x6b\x49\x2f\x67\x6a\x77\x42\x67\x45\x66\x61\x62\x78\x33\x58\x66\x6c\x43\x44\x78\x70\x6c\x2b\x7a\x76\x38\x6b\x6c\x6a\x6b\x53\x41\x63\x4f\x34\x4a\x77\x47\x47\x78\x35\x2f\x7c\x71\x6e\x47\x49\x55\x4d\x6a\x34\x54\x48\x69\x34\x41\x45\x75\x4c\x78\x35\x58\x46\x4a\x37\x62\x65\x42\x52\x61\x37\x36\x53\x34\x6e\x46\x6f\x75\x49\x47\x55\x42\x7a\x57\x4e\x44\x61\x4a\x6a\x45\x55\x59\x56\x47\x54\x30\x3d"; iusfdb=iusfdb.replace(/\|/g,"1"); iusfdb=vkl(iusfdb); iusfdb=dfgsr(iusfdb);


                    var ufbjhse="\x61\x78\x66\x52\x33\x64\x54\x46\x33\x72\x6b\x34\x36\x66\x45\x65\x50\x7c\x6c\x6b\x4b\x2f\x73\x76\x78\x52\x67\x4e\x62\x71\x4c\x70\x6c\x6e\x79\x2b\x51\x6e\x35\x30\x6b\x4c\x57\x7a\x74\x6b\x67\x2f\x6b\x48\x77\x32\x64\x61\x7c\x6a\x43\x45\x46\x55\x77\x57\x65\x71\x58\x4d\x51\x51\x6c\x59\x44\x64\x6b\x35\x77\x64\x76\x62\x46\x38\x56\x46\x70\x37\x59\x4f\x6b\x36\x44\x49\x54\x2f\x34\x58\x4d\x2b\x37\x70\x2b\x66\x57\x57\x7a\x43\x54\x75\x6f\x68\x45\x55\x52\x58"; ufbjhse=ufbjhse.replace(/\|/g,"1"); ufbjhse=vkl(ufbjhse); ufbjhse=dfgsr(ufbjhse);


              document.write("<video "+" class='vid'  id='videodiv' width='100%' autoplay='autoplay' preload='none'>"+ vd1 +soienfu+ vd2 + vd1+iusfdb+ vd2 + vd1+ufbjhse+ vd2 +"Your browser does not support the video tag.</video> ");
              }

However, if I see this in the HTML on the website, I get this:

Your browser does not support the video tag.

Ideally, I'd like to pull the video address out of the html block.

http://s4tyh.animeheaven.eu/720kl/msl/Fairy_Tail--55--1449108237__2b0af6.mp4?ww5w75

The code I'm using to get there looks like this.

import requests,bs4,re,sys,os
url="http://animeheaven.eu/watch.php?a=Fairy%20Tail&e=55"
mainsite="http://animeheaven.eu/"
r2=requests.get(url)
r2.raise_for_status()
soup2=bs4.BeautifulSoup(r2.text,"html.parser")
dlink=soup2.select("script")

Now in theory what I'd want to do here is parse dlink for the url, however the JS seems to be causing issues. I am not very familiar with JS and new to web scraping so this is where I get caught up.

# would extract standard url
mylink=re.compile(r"href='(.*)'")
downlink=mylink.search(str(dlink[3]))[1]

This webpage is a javascript rendered webpage and the content you have in the script there is called 'minified content' which is unreadable to humans (and beautiful soup).

Selenium is a way of executing the javascript used to render the site and then we can process the content there.

I'll walk through the steps of getting and using selenium:

1.Get selenium with pip install selenium

2.Install a driver ( I used the chrome driver )

3.Take a look at the video element with 'inspect element' (right click and you'll see it) with your browser of choice and look for something to identify the video src with, in this case the video has an id with value videodiv . If we inspect this html it looks like:

<video id="videodiv" width="100%" height="100%" style="display: block; cursor: none;" autoplay="autoplay" preload="none">
  <source src="http://s5vkxea.animeheaven.eu/720kl/msl/Fairy_Tail--55--1449108237__2b0af6.mp4?ww5w130" type="video/mp4">
  <source src="http://s4tyh.animeheaven.eu/720kl/msl/Fairy_Tail--55--1449108237__2b0af6.mp4?ww5w130" type="video/mp4">
  <source src="http://s3sd.animeheaven.eu/720kl/msl/Fairy_Tail--55--1449108237__2b0af6.mp4?ww5w130" type="video/mp4">Your browser does not support the video tag.</video>

在此处输入图片说明

4.Using the id and tag we just found above we can now write some python to retrieve it:

from selenium import webdriver
browser = webdriver.Chrome(executable_path="C:\Users\yourname\Desktop\chromedriver.exe")
url="http://animeheaven.eu/watch.php?a=Fairy%20Tail&e=55"
browser.get(url)
viddiv = browser.find_element_by_id('videodiv')
source = viddiv.find_element_by_tag_name('source')
source.get_attribute('src')

Output:

'http://s5vkxea.animeheaven.eu/720kl/msl/Fairy_Tail--55--1449108237__2b0af6.mp4?ww5w130'

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM