简体   繁体   English

如何仅删除带有正则表达式的嵌入式图像?

[英]How can I remove only inline images with a regular expression?

I have a lot of user generated content that has inline images in it, in this style: 我有很多用户生成的内容,其中包含以下格式的内嵌图像:

<img src="data:image/gif;base64,R0lGODlhEAAOALMAAOazToeHh0tLS/7LZv/0jvb2
/ge8WSLf/rhf/3kdbW1mxsbP//mf///yH5BAAAAAAALAAAAAAQAA4AAARe8L1hwLJoExKcpp
V0aCcGCmTIHEIUEqjgaORCMxIC6e0CcguWw6aFjsVMkkIr7g77ZKPJjd7sJAgVGoEgAwXEQA7" 
width="16" height="14" alt="embedded folder icon">

Some of the images are gif, some are png, but it is possible they are of other image types, too. 有些图像是gif,有些图像是png,但也有可能是其他图像类型。

I'd like to be able to remove inline images like this with PHP. 我希望能够使用PHP删除此类内联图像。 I'm guessing that the way to do it would be a regular expression, even though overuse of regex seems to be frowned on in the Stack Overflow community as it is often used in place of other tools that work better and are designed for a specific purpose. 我猜这样做的方法将是一个正则表达式,即使过度使用正则表达式似乎在Stack Overflow社区中不受欢迎,因为它经常被用来代替其他工具更好地工作并且是针对特定的目的。

However, for this scenario I could think of no other tools that might do the job other than strip_tags , but I do want to keep images that are not encoded inline. 但是,对于这种情况,我可以想到除了strip_tags之外没有其他工具可以完成这项工作,但我确实希望保留不是内联编码的图像。

So, how can I use a regular expression to filter out just inline images with PHP? 因此,如何使用正则表达式通过PHP过滤掉内联图像? Or, if there is a better tool to do this, what is it? 或者,如果有更好的工具来执行此操作,那是什么?

A regular expression sounds fine to me. 正则表达对我来说听起来不错。 Just have it match data:image/gif and other types of images you want to remove if and only if they occur within an img tag. 只要它匹配data:image/gif和其他类型的图像,当且仅当它们出现在img标记内时才要删除。

Here's a starting point, expand it to your liking: 这是一个起点,将其扩展为您喜欢的:

<img[^>]* src=['"]?data:image/gif[^>]*>

Make sure to run it with the ignore case flag and test the hell out of it before you put it live. 确保使用ignore case标志运行它并在你将其置于现场之前测试它的地狱。

Regexes aren't frowned on in general. 一般而言,正则数据并不令人沮丧。 They're a tool like any other in the PHP toolbox. 它们是PHP工具箱中的任何其他工具。 The problems start coming once you're using regexes to parse HTML. 一旦使用正则表达式解析HTML,问题就会开始出现。 For small "known format" snippets, you can get away with it. 对于小的“已知格式”摘要,您可以摆脱。 But as a general HTML manipulation tool, regexes simply can NOT guarantee you'll get good results, as HTML is not a regular language. 但作为一般的HTML操作工具,正则表达式根本无法保证您将获得良好的结果,因为HTML不是常规语言。

As with most HTML manipulations, use DOM: 与大多数HTML操作一样,请使用DOM:

$dom = new DOMDocument;
$dom->loadHTML(...);
$xp = new DOMXPath($dom);

$images = $xp->query("//img[starts-with(@src,'data:image']");

foreach($images as $img) {
    $img->parentNode->removeChild($img);
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM