简体   繁体   English

使用 preg_replace 更新 src 值

[英]Update src value using preg_replace

I have some <img> tags like these:我有一些像这样的<img>标签:

<img alt="" src="{assets_8170:{filedir_14}test.png}" style="width: 700px; height: 181px;" />
<img src="{filedir_14}test.png" alt="" />

And I need to update the src value, extracting the filename and adding it inside a WordPress shortcode:我需要更新 src 值,提取文件名并将其添加到 WordPress 简码中:

<img src="[my-shortcode file='test.png']" ... />

The regex to extract the filename is this one: [a-zA-Z_0-9-()]+\.[a-zA-Z]{2,4} , but I am not able to create the complete regex, considering that the image tag attributes do not follow the same order in all instances.提取文件名的正则表达式是这个: [a-zA-Z_0-9-()]+\.[a-zA-Z]{2,4} ,但考虑到我无法创建完整的正则表达式图像标记属性在所有实例中都不遵循相同的顺序。

PHP - Parsing html contents, making transforms and returning the resulting html PHP - 解析 html 内容,进行转换并返回结果 html

The answer grew bigger during its lifecycle trying to address the issue.答案在其试图解决问题的生命周期中变得越来越大。

Several attempts were made but the latest one (loadXML/saveXML) nailed it.进行了几次尝试,但最近一次尝试 (loadXML/saveXML) 成功了。

DOMDocument - loadHTML and saveHTML DOMDocument - loadHTML 和 saveHTML

If you need to parse an html string in php so that you can later fetch and modify its content in a structured and safe manner without breaking the encoding, you can use DOMDocument::loadHTML() :如果您需要解析 php 中的 html 字符串,以便稍后可以在不破坏编码的情况下以结构化和安全的方式获取和修改其内容,您可以使用DOMDocument::loadHTML()

https://www.php.net/manual/en/domdocument.loadhtml.php https://www.php.net/manual/en/domdocument.loadhtml.php

Here I show how to parse your html string, fetch all its <img> elements and for each of them how to retrieve their src attribute and set it with an arbitrary value.在这里,我展示了如何解析您的 html 字符串,获取它的所有<img>元素,以及如何为每个元素检索它们的src属性并将其设置为任意值。

At the end to return the html string of the transformed document, you can use DOMDocument::saveHTML :最后要返回转换文档的 html 字符串,您可以使用DOMDocument::saveHTML

https://www.php.net/manual/en/domdocument.savehtml https://www.php.net/manual/en/domdocument.savehtml

Taking into account the fact that by default the document will contain the basic html frame wrapping your original content.考虑到默认情况下文档将包含包装原始内容的基本 html 框架这一事实。 So to be sure the resulting html will be limited to that part only, here I show how to fetch the body content and loop through its children to return the final composition:因此,为了确保生成的 html 仅限于该部分,这里我展示了如何获取body内容并循环遍历其子项以返回最终合成:

https://onlinephp.io/c/157de https://onlinephp.io/c/157de

<?php

$html = "
<img alt=\"\" src=\"{assets_8170:{filedir_14}test.png}\" style=\"width: 700px; height: 181px;\" />
<img src=\"{filedir_14}test.png\" alt=\"\" />
";

$transformed = processImages($html);

echo $transformed;

function processImages($html){

    //parse the html fragment
    $dom = new DOMDocument();
    $dom->loadHTML($html);
    
    //fetch the <img> elements
    $images = $dom->getElementsByTagName('img');
    
    //for each <img>
    foreach ($images as $img) {
        //get the src attribute
        $src = $img->getAttribute('src');
        //set the src attribute
        $img->setAttribute('src', 'bogus');
    }
    
    //return the html modified so far (body content only)
    $body = $dom->getElementsByTagName('body')->item(0);
    $bodyChildren = $body->childNodes;
    $bodyContent = '';
    foreach ($bodyChildren as $child) {
        $bodyContent .= $dom->saveHTML($child);
    }
    return $bodyContent;
}

Problems with src attribute value restrictions src 属性值限制的问题

After reading on comments you pointed out that saveHTML was returning an html where the image src attribute value had its special characters escaped I made some more research...在阅读评论后,您指出saveHTML返回了一个 html,其中图像src属性值对其特殊字符进行了转义,我做了更多研究......

The reason why that happens it's because DOMDocument wants to make sure that the src attribute contains a valid url and { , } are not valid characters.发生这种情况的原因是因为 DOMDocument 想要确保src属性包含有效的 url 并且{ , }不是有效字符。

Evidence that it doesn't happen with custom data attributes自定义数据属性不会发生这种情况的证据

For example if I added an attribute like data-test="mycustomcontent: {wildlyusingwhatever}" that one was going to be returned untouched because it didn't require strict rules to adhere to.例如,如果我添加了一个类似data-test="mycustomcontent: {wildlyusingwhatever}"的属性,该属性将原封不动地返回,因为它不需要遵守严格的规则。

Quick fix to make it work (defeating the parser as a whole)快速修复以使其工作(击败整个解析器)

Now to put a fix on that all I could come out with so far was this:现在要解决这个问题,到目前为止我能想到的是:

https://onlinephp.io/c/0e334 https://onlinephp.io/c/0e334

//VERY UNSAFE -- replace the in $bodyContent %7B as { and %7D as }
$bodyContent = str_replace("%7B", "{", $bodyContent);
$bodyContent = str_replace("%7D", "}", $bodyContent);
return $bodyContent;

But of course it's nor safe nor smart and neither a very good solution.但当然它既不安全也不智能,也不是一个很好的解决方案。 First of all because it defeats the whole purpose of using a parser instead of regex and secondly because it could seriously damage the result.首先是因为它破坏了使用解析器而不是正则表达式的全部目的,其次是因为它可能会严重损坏结果。

A better approach using loadXML and saveXML使用 loadXML 和 saveXML 的更好方法

To prevent the html rules to kick in, it could be attempted the route of parsing the text as XML instead of HTML so that it will still adhere to the nested markdown syntax (difficult/impossible to deal with using regex) but it won't apply all the restrictions about contents.为了防止 html 规则生效,可以尝试将文本解析为 XML 而不是 HTML 的路线,这样它仍然会遵守嵌套的 markdown 语法(使用正则表达式很难/不可能处理)但它不会应用有关内容的所有限制。

I modified the core logic by doing this:我通过这样做修改了核心逻辑:

//loads the html content as xml wrapping it with a root element
$dom->loadXml("<root>${html}</root>");

//...

//returns the xml content of each children in <root> as processed so far
$rootNode = $dom->childNodes[0];
$children = $rootNode->childNodes;
$content = '';
foreach ($children as $child) {
   $content .= $dom->saveXML($child);
}
    
return $content;

And this is the working demo: https://onlinephp.io/c/f9de1这是工作演示: https://onlinephp.io/c/f9de1

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM