简体   繁体   English

使用python beautifulsoup抓取http链接

[英]Scraping a http link using python beautifulsoup

I am trying scrape an http link from a site using regex and so far I have tried this. 我正在尝试使用正则表达式从网站上抓取http链接,到目前为止,我已经尝试过了。

from urllib.request import urlopen
from bs4 import BeautifulSoup as soup
import urllib.request
import re
import csv
import time
import string
import sys
import requests
import json
user_agent1 = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.7) Gecko/2009021910 Firefox/3.0.7'
url = "https://somesite.com"
headers={'User-Agent':user_agent1,}
request=urllib.request.Request(url,None,headers)
response = urllib.request.urlopen(request)
data = response.read()
page_html = data
page_soup = soup(page_html,"html.parser")
data1  = page_soup.find_all("script")[5].string

in this code if i print(data1) its gonna give this output 在此代码中,如果我打印(data1),它将给出此输出

                var playerInstance = jwplayer("container");
            playerInstance.setup({
                width: "100%",
                height: "100%",
                controls: true,
                flashplayer: "http://p.jwpcdn.com/player
/v/7.3.6/jwplayer.flash.swf",
                aspectratio: "16:9",
                fullscreen: "true",
                primary: 'html5',
                displaytitle: true,
                "preload": "auto",
                autostart: false,
                sources: [{"file":"https://archive.org/v.mp4","label":"1080p","type":"video/mp4
"}]
            });

in this output from variable data1 I am trying scrape http link 在此变量data1的输出中,我尝试抓取http链接

https://archive.org/v.mp4

so I added few lines of more code: 所以我添加了几行更多的代码:

p = re.compile('file :(.*?);')
m = p.match(data1)
print(m)

but when I tried to print(m) it gives output None . 但是当我尝试print(m)时,输出为None How can I scrape http link into a variable? 如何将http链接抓取到变量中?

I believe that I am making mistake in regex expression. 我相信我在正则表达式表达中犯了错误。

This might help: 这可能会有所帮助:

import re
a = """var playerInstance = jwplayer("container");
            playerInstance.setup({
                width: "100%",
                height: "100%",
                controls: true,
                flashplayer: "http://p.jwpcdn.com/player
/v/7.3.6/jwplayer.flash.swf",
                aspectratio: "16:9",
                fullscreen: "true",
                primary: 'html5',
                displaytitle: true,
                "preload": "auto",
                autostart: false,
                sources: [{"file":"https://archive.org/v.mp4","label":"1080p","type":"video/mp4
"}]
            });"""

src = a.split("sources:")[1]
print re.search('(?P<url>https?://[^\s]([^"]*))', src).group("url")

Output: 输出:

https://archive.org/v.mp4

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM