Python regex: search for HTML tags and modify them

Question

I got an HTML file as string and want to change all <img src="http:.../../filename.png ..> to <img src="id:filename.png> How can I do this with regex?

I got this so far:

urls = re.findall(r'src=[\'"]?([^\'" >]+)', html)
allUrls = ', '.join(urls)

Answer 1

If you control the HTML, then regular expressions are fine.

Python:

import re
html = re.sub(r'(<img src=").+/(.+">)', r'\1id:\2', html)

HTML:

<img src="http://example.com/filename1.jpg">
<img src="http://example.com/filename2.jpg">

Otherwise, a regular expression would get extremely messy. I suggest lxml . BeautifulSoup is also great.

import lxml.etree, os, urlparse
root = lxml.etree.HTML(html)
for img in root.iter("img"):
    src = img.get("src", None)
    if src is not None:
        if urlparse.urlparse(src).scheme in ("http", "https"):
            src_path = urlparse.urlparse(src).path
            src_path_base = os.path.basename(src)
            src = "id:" + os.path.basename(src)
            img.set("src", src)
html = lxml.etree.tostring(root)

This copes with many cases that would be awkward if not impossible with regular expressions. Examples:

<img src=http://example.com/filename.jpg>

<img src=http%3A%2F%2Fexample.com%2Ffilename.jpg>

<img title="src=http://example.com/bait.jpg" src=http://example.com/filename.jpg>

<img src=data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==>

<img
src
= 
http://example.com/filename.jpg
>

<img src="http://example.com/book report cover.jpg"> <!-- invalid but common -->

Python regex: search for HTML tags and modify them

Question

1 answers

solution1
0 ACCPTED 2013-09-10 13:17:18

Python regex: search for HTML tags and modify them

Question

1 answers

solution1 0 ACCPTED 2013-09-10 13:17:18

solution1
0 ACCPTED 2013-09-10 13:17:18