簡體   English   中英

使用python beautifulsoup抓取http鏈接

[英]Scraping a http link using python beautifulsoup

我正在嘗試使用正則表達式從網站上抓取http鏈接,到目前為止,我已經嘗試過了。

from urllib.request import urlopen
from bs4 import BeautifulSoup as soup
import urllib.request
import re
import csv
import time
import string
import sys
import requests
import json
user_agent1 = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.7) Gecko/2009021910 Firefox/3.0.7'
url = "https://somesite.com"
headers={'User-Agent':user_agent1,}
request=urllib.request.Request(url,None,headers)
response = urllib.request.urlopen(request)
data = response.read()
page_html = data
page_soup = soup(page_html,"html.parser")
data1  = page_soup.find_all("script")[5].string

在此代碼中,如果我打印(data1),它將給出此輸出

                var playerInstance = jwplayer("container");
            playerInstance.setup({
                width: "100%",
                height: "100%",
                controls: true,
                flashplayer: "http://p.jwpcdn.com/player
/v/7.3.6/jwplayer.flash.swf",
                aspectratio: "16:9",
                fullscreen: "true",
                primary: 'html5',
                displaytitle: true,
                "preload": "auto",
                autostart: false,
                sources: [{"file":"https://archive.org/v.mp4","label":"1080p","type":"video/mp4
"}]
            });

在此變量data1的輸出中,我嘗試抓取http鏈接

https://archive.org/v.mp4

所以我添加了幾行更多的代碼:

p = re.compile('file :(.*?);')
m = p.match(data1)
print(m)

但是當我嘗試print(m)時,輸出為None 如何將http鏈接抓取到變量中?

我相信我在正則表達式表達中犯了錯誤。

這可能會有所幫助:

import re
a = """var playerInstance = jwplayer("container");
            playerInstance.setup({
                width: "100%",
                height: "100%",
                controls: true,
                flashplayer: "http://p.jwpcdn.com/player
/v/7.3.6/jwplayer.flash.swf",
                aspectratio: "16:9",
                fullscreen: "true",
                primary: 'html5',
                displaytitle: true,
                "preload": "auto",
                autostart: false,
                sources: [{"file":"https://archive.org/v.mp4","label":"1080p","type":"video/mp4
"}]
            });"""

src = a.split("sources:")[1]
print re.search('(?P<url>https?://[^\s]([^"]*))', src).group("url")

輸出:

https://archive.org/v.mp4

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM