简体   繁体   中英

How to replace URLs in HTML Document using BeautifulSoup

I'm trying to remove all URL links in an HTML document to be left with only relative links (instead of absolute), using BeautifulSoup. For example, I'm trying to build code that would transform this HTML tag from this:

<a href="https://www.mertens-stahl.de/berlin/unternehmen.php">

into this:

<a href="/berlin/unternehmen.php">

I haven't come across a solution that works, so my code sample looks like this so far:

url = https://www.mertens-stahl.de
html = requests.get("https://www.mertens-stahl.de/berlin/downloads.php").text
soup = BeautifulSoup(html, "html.parser")
soup.find(url).replace_with("")

This yields the error AttributeError: 'NoneType' object has no attribute 'replace_with' , so I'm looking for a proper way to solve this. Thanks!

this should do the trick

from urllib.parse import urlparse
links=soup.select('a[href^="https://www.mertens-stahl.de"]')
for link in links:
    link['href']=urlparse(link['href']).path

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM