简体   繁体   中英

How can I remove only inline images with a regular expression?

I have a lot of user generated content that has inline images in it, in this style:

<img src="data:image/gif;base64,R0lGODlhEAAOALMAAOazToeHh0tLS/7LZv/0jvb2
/ge8WSLf/rhf/3kdbW1mxsbP//mf///yH5BAAAAAAALAAAAAAQAA4AAARe8L1hwLJoExKcpp
V0aCcGCmTIHEIUEqjgaORCMxIC6e0CcguWw6aFjsVMkkIr7g77ZKPJjd7sJAgVGoEgAwXEQA7" 
width="16" height="14" alt="embedded folder icon">

Some of the images are gif, some are png, but it is possible they are of other image types, too.

I'd like to be able to remove inline images like this with PHP. I'm guessing that the way to do it would be a regular expression, even though overuse of regex seems to be frowned on in the Stack Overflow community as it is often used in place of other tools that work better and are designed for a specific purpose.

However, for this scenario I could think of no other tools that might do the job other than strip_tags , but I do want to keep images that are not encoded inline.

So, how can I use a regular expression to filter out just inline images with PHP? Or, if there is a better tool to do this, what is it?

A regular expression sounds fine to me. Just have it match data:image/gif and other types of images you want to remove if and only if they occur within an img tag.

Here's a starting point, expand it to your liking:

<img[^>]* src=['"]?data:image/gif[^>]*>

Make sure to run it with the ignore case flag and test the hell out of it before you put it live.

Regexes aren't frowned on in general. They're a tool like any other in the PHP toolbox. The problems start coming once you're using regexes to parse HTML. For small "known format" snippets, you can get away with it. But as a general HTML manipulation tool, regexes simply can NOT guarantee you'll get good results, as HTML is not a regular language.

As with most HTML manipulations, use DOM:

$dom = new DOMDocument;
$dom->loadHTML(...);
$xp = new DOMXPath($dom);

$images = $xp->query("//img[starts-with(@src,'data:image']");

foreach($images as $img) {
    $img->parentNode->removeChild($img);
}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM