I'm new to Javascript and trying to parse through it using Python but i've been giving it a go using BeautifulSoup along with Requests to extract the 'file' line out of the 'RT.currentVideo' section of this script, but i can't seem to. I'm completly lost as to how i'd even be able to store this section of the webpage as it doesn't have an identifier like most other questions related to this i've found online.
Any help would really be appreciated, thanks for taking the time to check in!
This is what i've been using to read the page:
url = "http://roosterteeth.com/episode/rt-docs-connected-connected-official-trailer"
req = Request(url, headers={'User-Agent': 'Mozilla/5.0', 'Accept-Encoding': 'utf-8'})
response = urlopen(req)
webpage = BeautifulSoup(response.read().decode('utf-8', 'ignore'), "html.parser")
And this is the Javascript block on the page i want to extract info from. Again, what i'm looking to get is the string in the 'file' variable.
<script>
RT.currentVideo = {
authUser: 0,
autoPlay: 1,
csrfToken: 'H240Yw8x9oYasUw2Tzt3qpwzA14Z1ajRjuXo6RV1',
endPoint: 89,
desktopAgent: 1,
file: 'https://rtv2-video.roosterteeth.com/uploads/videos/0e840b4f-a188-440d-adc0-b78093c1009f/index.m3u8',
You can use regex to extract that from the page html.
import re
regex = r"file:\s*?'(.+)'"
matches = re.findall(regex, webpageHtmlString)
print(matches[0])
webpageHtmlString
should be the html of the page as string.
Use PyQuery to get jquery like querying on html content using python.
from pyquery import PyQuery as pq
scripttags = pq('src') ## will output a list of script tags
print(scriptTags[0].src)
Based on your content you can use Jquery like querying
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.