使用python beautifulsoup抓取http鏈接

Question

我正在嘗試使用正則表達式從網站上抓取http鏈接，到目前為止，我已經嘗試過了。

from urllib.request import urlopen
from bs4 import BeautifulSoup as soup
import urllib.request
import re
import csv
import time
import string
import sys
import requests
import json
user_agent1 = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.7) Gecko/2009021910 Firefox/3.0.7'
url = "https://somesite.com"
headers={'User-Agent':user_agent1,}
request=urllib.request.Request(url,None,headers)
response = urllib.request.urlopen(request)
data = response.read()
page_html = data
page_soup = soup(page_html,"html.parser")
data1  = page_soup.find_all("script")[5].string

在此代碼中，如果我打印（data1），它將給出此輸出

                var playerInstance = jwplayer("container");
            playerInstance.setup({
                width: "100%",
                height: "100%",
                controls: true,
                flashplayer: "http://p.jwpcdn.com/player
/v/7.3.6/jwplayer.flash.swf",
                aspectratio: "16:9",
                fullscreen: "true",
                primary: 'html5',
                displaytitle: true,
                "preload": "auto",
                autostart: false,
                sources: [{"file":"https://archive.org/v.mp4","label":"1080p","type":"video/mp4
"}]
            });

在此變量data1的輸出中，我嘗試抓取http鏈接

https://archive.org/v.mp4

所以我添加了幾行更多的代碼：

p = re.compile('file :(.*?);')
m = p.match(data1)
print(m)

但是當我嘗試print(m)時，輸出為None 。 如何將http鏈接抓取到變量中？

我相信我在正則表達式表達中犯了錯誤。

Answer 1

這可能會有所幫助：

import re
a = """var playerInstance = jwplayer("container");
            playerInstance.setup({
                width: "100%",
                height: "100%",
                controls: true,
                flashplayer: "http://p.jwpcdn.com/player
/v/7.3.6/jwplayer.flash.swf",
                aspectratio: "16:9",
                fullscreen: "true",
                primary: 'html5',
                displaytitle: true,
                "preload": "auto",
                autostart: false,
                sources: [{"file":"https://archive.org/v.mp4","label":"1080p","type":"video/mp4
"}]
            });"""

src = a.split("sources:")[1]
print re.search('(?P<url>https?://[^\s]([^"]*))', src).group("url")

輸出：

https://archive.org/v.mp4

使用python beautifulsoup抓取http鏈接

問題描述

1 個解決方案

解決方案1
0 已采納 2018-02-19 13:33:28

使用python beautifulsoup抓取http鏈接

問題描述

1 個解決方案

解決方案1 0 已采納 2018-02-19 13:33:28

解決方案1
0 已采納 2018-02-19 13:33:28