简体   繁体   中英

How to scrape text in a href by Beautiful Soup?

I have a href in format <a href="javascript:ShowImg('../UploadFile/Images/c/1/B_27902.jpg');"> , and I want to get the url with '../UploadFile/Images/c/1/B_27902.jpg' . I used a stupid way to get it:( I want to know if there is a more easier way to get it.

url = '<a href="javascript:ShowImg('../UploadFile/Images/c/1/B_27902.jpg');">'
html = url.get('href')
html = html.replace('javascript:ShowImg(', '').replace(');', '')

The original tag as below:

<a href="javascript:ShowImg('../UploadFile/Images/c/1/B_27902.jpg');">
<img height="110" onerror="this.src='../UploadFile/Images/no_pic_big.jpg';"
src="../UploadFile/Images/c/1/S_27902.jpg" width="170"/>
</a>

BeautifulSoup can apply a compiled regular expression pattern to attribute values when searching for elements. You then can use the same pattern to extract the desired part of it:

import re
from bs4 import BeautifulSoup

data = """
<a href="javascript:ShowImg('../UploadFile/Images/c/1/B_27902.jpg');">
<img height="110" onerror="this.src='../UploadFile/Images/no_pic_big.jpg';"
src="../UploadFile/Images/c/1/S_27902.jpg" width="170"/>
</a>
"""

soup = BeautifulSoup(data, "html.parser")
pattern = re.compile(r"javascript:ShowImg\('(.*?)'\);")

href = soup.find('a', href=pattern)["href"]
link = pattern.search(href).group(1)
print(link)  # prints ../UploadFile/Images/c/1/B_27902.jpg

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM