從一串html數據中提取網址

Question

我已經嘗試過使用BeautifulSoup提取此html數據，但僅受標簽限制。 我需要做的是在前綴www.example.com/products/之后獲得尾隨的something.html或some/something.html ，同時消除諸如?search=1類的參數。 我更喜歡使用正則表達式，但是我不知道確切的模式。

輸入：

System","urlKey":"ppath","value":[],"hidden":false,"locked":false}],"bizData":"Related+Categories=Mobiles","pos":0},"listItems":[{"name":"Sam-Sung B309i High Precision Smooth Keypad Mobile Phone ","nid":"250505808","icons":[],"productUrl":"//www.example.com/products/sam-sung-b309i-high-precision-smooth-keypad-mobile-phone-i250505808-s341878516.html?search=1", "image": ["//www.example.com/products/site/ammaxxllx.html], "https://www.example.com/site/kakzja.html

prefix = "www.example.com/products/"
# do something
# expected output: ['sam-sung-b309i-high-precision-smooth-keypad-mobile-phone-i250505808-s341878516.html', 'site/ammaxxllx.html']

Answer 1

我想您想在這里使用re一個小技巧，因為我“？” 將遵循URI中的“ html”：

import re 

L = ["//www.example.com/products/ammaxxllx.html", "https://www.example.com/site/kakzja.html", "//www.example.com/products/sam-sung-b309i-high-precision-smooth-keypad-mobile-phone-i250505808-s341878516.html?search=1"]
prefix = "www.example.com/products/"

>>> [re.search(prefix+'(.*)html', el).group(1) + 'html' for el in L if prefix in el]
['ammaxxllx.html', 'sam-sung-b309i-high-precision-smooth-keypad-mobile-phone-i250505808-s341878516.html']

Answer 2

盡管以上使用re模塊的答案都很棒。 您也可以不使用該模塊而變通。 像這樣：

prefix = 'www.example.com/products/'
L = ['//www.example.com/products/sam-sung-b309i-high-precision-smooth-keypad-mobile-phone-i250505808-s341878516.html?search=1', '//www.example.com/products/site/ammaxxllx.html', 'https://www.example.com/site/kakzja.html']
ans = []
for l in L:
    input_ = l.rsplit(prefix, 1)
    try:
        input_ = input_[1]
        ans.append(input_[:input_.index('.html')] + '.html')
    except Exception as e:
        pass
print ans
['sam-sung-b309i-high-precision-smooth-keypad-mobile-phone-i250505808-s341878516.html', 'site/ammaxxllx.html']

Answer 3

另一種選擇是使用urlparse代替/與re一起使用

它將允許您分割這樣的URL：

import urlparse

my_url = "http://www.example.com/products/ammaxxllx.html?spam=eggs#sometag"
url_obj = urlparse.urlsplit(my_url)

url_obj.scheme
>>> 'http'
url_obj.netloc
>>> 'www.example.com'
url_obj.path
>>> '/products/ammaxxllx.html'
url_obj.query
>>> 'spam=eggs'
url_obj.fragment
>>> 'sometag'

# Now you're able to work with every chunk as wanted! 
prefix = '/products'
if url_obj.path.startswith(prefix):
    # Do whatever you need, replacing the initial characters. You can use re here
    print url_obj.path[len(prefix) + 1:]
>>>> ammaxxllx.html

從一串html數據中提取網址

問題描述

3 個解決方案

解決方案1
1 2018-09-28 15:14:35

解決方案2
0 2018-09-28 15:30:24

解決方案3
0 2018-09-28 16:20:10

從一串html數據中提取網址

問題描述

3 個解決方案

解決方案1 1 2018-09-28 15:14:35

解決方案2 0 2018-09-28 15:30:24

解決方案3 0 2018-09-28 16:20:10

解決方案1
1 2018-09-28 15:14:35

解決方案2
0 2018-09-28 15:30:24

解決方案3
0 2018-09-28 16:20:10