简体   繁体   English

如何使用Python(BS4)解析Javascript呈现的网页

[英]How to parse a Javascript rendered webpage with Python (BS4?)

All - I am using BS4 currently to parse a webpage. 全部-我目前正在使用BS4解析网页。 BS4 returns this block which is coded in JS as a string and cannot recognize the urls I am trying to extract. BS4返回此块,该块在JS中编码为字符串,并且无法识别我尝试提取的URL。

I believe the part I am attempting to extract in BS4 is here: 我相信我要在BS4中提取的部分在这里:

              var vd1="\x3c\x73\x6f\x75\x72\x63\x65\x20\x73\x72\x63\x3d\x27";
              var vd2="\x27\x20\x74\x79\x70\x65\x3d\x27\x76\x69\x64\x65\x6f\x2f\x6d\x70\x34\x27\x3e";

              var luu=pkl("uggc://navzrurnira.rh/tvs.cuc?vcqrgrpgrq");


                    var soienfu="\x61\x78\x66\x52\x33\x64\x54\x46\x33\x72\x6b\x36\x36\x2f\x67\x4f\x65\x46\x4d\x6b\x49\x2f\x67\x6a\x77\x42\x67\x45\x66\x61\x62\x78\x33\x58\x66\x6c\x43\x44\x78\x70\x6c\x2b\x7a\x76\x38\x6b\x6c\x6a\x6b\x53\x41\x63\x4f\x34\x4a\x77\x47\x47\x78\x35\x2f\x7c\x71\x6e\x47\x49\x55\x4d\x6a\x34\x54\x48\x69\x34\x41\x45\x75\x4c\x78\x35\x58\x46\x4a\x37\x62\x65\x42\x52\x61\x37\x36\x53\x34\x6e\x46\x6f\x75\x49\x47\x55\x42\x7a\x57\x4e\x44\x61\x4a\x6a\x45\x55\x59\x56\x47\x54\x30\x3d"; soienfu=soienfu.replace(/\|/g,"1"); soienfu=vkl(soienfu); soienfu=dfgsr(soienfu);


                    var iusfdb="\x61\x78\x66\x52\x33\x64\x54\x46\x33\x72\x6b\x35\x36\x2b\x4d\x4f\x64\x46\x6b\x6b\x49\x2f\x67\x6a\x77\x42\x67\x45\x66\x61\x62\x78\x33\x58\x66\x6c\x43\x44\x78\x70\x6c\x2b\x7a\x76\x38\x6b\x6c\x6a\x6b\x53\x41\x63\x4f\x34\x4a\x77\x47\x47\x78\x35\x2f\x7c\x71\x6e\x47\x49\x55\x4d\x6a\x34\x54\x48\x69\x34\x41\x45\x75\x4c\x78\x35\x58\x46\x4a\x37\x62\x65\x42\x52\x61\x37\x36\x53\x34\x6e\x46\x6f\x75\x49\x47\x55\x42\x7a\x57\x4e\x44\x61\x4a\x6a\x45\x55\x59\x56\x47\x54\x30\x3d"; iusfdb=iusfdb.replace(/\|/g,"1"); iusfdb=vkl(iusfdb); iusfdb=dfgsr(iusfdb);


                    var ufbjhse="\x61\x78\x66\x52\x33\x64\x54\x46\x33\x72\x6b\x34\x36\x66\x45\x65\x50\x7c\x6c\x6b\x4b\x2f\x73\x76\x78\x52\x67\x4e\x62\x71\x4c\x70\x6c\x6e\x79\x2b\x51\x6e\x35\x30\x6b\x4c\x57\x7a\x74\x6b\x67\x2f\x6b\x48\x77\x32\x64\x61\x7c\x6a\x43\x45\x46\x55\x77\x57\x65\x71\x58\x4d\x51\x51\x6c\x59\x44\x64\x6b\x35\x77\x64\x76\x62\x46\x38\x56\x46\x70\x37\x59\x4f\x6b\x36\x44\x49\x54\x2f\x34\x58\x4d\x2b\x37\x70\x2b\x66\x57\x57\x7a\x43\x54\x75\x6f\x68\x45\x55\x52\x58"; ufbjhse=ufbjhse.replace(/\|/g,"1"); ufbjhse=vkl(ufbjhse); ufbjhse=dfgsr(ufbjhse);


              document.write("<video "+" class='vid'  id='videodiv' width='100%' autoplay='autoplay' preload='none'>"+ vd1 +soienfu+ vd2 + vd1+iusfdb+ vd2 + vd1+ufbjhse+ vd2 +"Your browser does not support the video tag.</video> ");
              }

However, if I see this in the HTML on the website, I get this: 但是,如果我在网站上的HTML中看到此内容,则会得到以下信息:

Your browser does not support the video tag. 您的浏览器不支持视频标签。

Ideally, I'd like to pull the video address out of the html block. 理想情况下,我想将视频地址拉出html块。

http://s4tyh.animeheaven.eu/720kl/msl/Fairy_Tail--55--1449108237__2b0af6.mp4?ww5w75 http://s4tyh.animeheaven.eu/720kl/msl/Fairy_Tail--55--1449108237__2b0af6.mp4?ww5w75

The code I'm using to get there looks like this. 我用来到达那里的代码如下所示。

import requests,bs4,re,sys,os
url="http://animeheaven.eu/watch.php?a=Fairy%20Tail&e=55"
mainsite="http://animeheaven.eu/"
r2=requests.get(url)
r2.raise_for_status()
soup2=bs4.BeautifulSoup(r2.text,"html.parser")
dlink=soup2.select("script")

Now in theory what I'd want to do here is parse dlink for the url, however the JS seems to be causing issues. 现在从理论上讲,我想在这里执行的操作是解析url的dlink,但是JS似乎引起了问题。 I am not very familiar with JS and new to web scraping so this is where I get caught up. 我对JS和Web抓取的新知识不是很熟悉,所以这就是我要追赶的地方。

# would extract standard url
mylink=re.compile(r"href='(.*)'")
downlink=mylink.search(str(dlink[3]))[1]

This webpage is a javascript rendered webpage and the content you have in the script there is called 'minified content' which is unreadable to humans (and beautiful soup). 该网页是用javascript呈现的网页,脚本中包含的内容称为“最小化内容”,人类无法理解(而且很漂亮)。

Selenium is a way of executing the javascript used to render the site and then we can process the content there. Selenium是一种执行用于呈现网站的javascript的方式,然后我们可以在那里处理内容。

I'll walk through the steps of getting and using selenium: 我将逐步介绍获取和使用硒的步骤:

1.Get selenium with pip install selenium 1.通过pip install selenium

2.Install a driver ( I used the chrome driver ) 2.安装驱动程序 (我用的是chrome驱动程序

3.Take a look at the video element with 'inspect element' (right click and you'll see it) with your browser of choice and look for something to identify the video src with, in this case the video has an id with value videodiv . 3.使用您选择的浏览器查看带有“检查元素”的视频元素(单击鼠标右键,您会看到它),并寻找可以识别视频源代码的内容,在这种情况下,视频的id为value videodiv If we inspect this html it looks like: 如果我们检查此html,它看起来像:

<video id="videodiv" width="100%" height="100%" style="display: block; cursor: none;" autoplay="autoplay" preload="none">
  <source src="http://s5vkxea.animeheaven.eu/720kl/msl/Fairy_Tail--55--1449108237__2b0af6.mp4?ww5w130" type="video/mp4">
  <source src="http://s4tyh.animeheaven.eu/720kl/msl/Fairy_Tail--55--1449108237__2b0af6.mp4?ww5w130" type="video/mp4">
  <source src="http://s3sd.animeheaven.eu/720kl/msl/Fairy_Tail--55--1449108237__2b0af6.mp4?ww5w130" type="video/mp4">Your browser does not support the video tag.</video>

在此处输入图片说明

4.Using the id and tag we just found above we can now write some python to retrieve it: 4,使用上面刚刚找到的id和标签,我们现在可以编写一些python来检索它:

from selenium import webdriver
browser = webdriver.Chrome(executable_path="C:\Users\yourname\Desktop\chromedriver.exe")
url="http://animeheaven.eu/watch.php?a=Fairy%20Tail&e=55"
browser.get(url)
viddiv = browser.find_element_by_id('videodiv')
source = viddiv.find_element_by_tag_name('source')
source.get_attribute('src')

Output: 输出:

'http://s5vkxea.animeheaven.eu/720kl/msl/Fairy_Tail--55--1449108237__2b0af6.mp4?ww5w130'

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM