去除除 src 之外的所有 HTML 屬性

Question

我正在嘗試刪除除src屬性之外的所有標簽屬性。 例如：

<p id="paragraph" class="green">This is a paragraph with an image <img src="/path/to/image.jpg" width="50" height="75"/></p>

將返回為：

<p>This is a paragraph with an image <img src="/path/to/image.jpg" /></p>

我有一個正則表達式來去除所有屬性，但我試圖調整它以保留在src 。 這是我到目前為止所擁有的：

<?php preg_replace('/<([A-Z][A-Z0-9]*)(\b[^>]*)>/i', '<$1>', '<html><goes><here>');

Answer 1

這可能適合您的需求：

$text = '<p id="paragraph" class="green">This is a paragraph with an image <img src="/path/to/image.jpg" width="50" height="75"/></p>';

echo preg_replace("/<([a-z][a-z0-9]*)(?:[^>]*(\ssrc=['\"][^'\"]*['\"]))?[^>]*?(\/?)>/i",'<$1$2$3>', $text);

// <p>This is a paragraph with an image <img src="/path/to/image.jpg"/></p>

RegExp 分解：

/              # Start Pattern
 <             # Match '<' at beginning of tags
 (             # Start Capture Group $1 - Tag Name
  [a-z]         # Match 'a' through 'z'
  [a-z0-9]*     # Match 'a' through 'z' or '0' through '9' zero or more times
 )             # End Capture Group
 (?:           # Start Non-Capture Group
  [^>]*         # Match anything other than '>', Zero or More Times
  (             # Start Capture Group $2 - ' src="...."'
   \s            # Match one whitespace
   src=          # Match 'src='
   ['"]          # Match ' or "
   [^'"]*        # Match anything other than ' or " 
   ['"]          # Match ' or "
  )             # End Capture Group 2
 )?            # End Non-Capture Group, match group zero or one time
 [^>]*?        # Match anything other than '>', Zero or More times, not-greedy (wont eat the /)
 (\/?)         # Capture Group $3 - '/' if it is there
 >             # Match '>'
/i            # End Pattern - Case Insensitive

添加一些引用，並使用替換文本<$1$2$3>它應該從格式良好的 HTML 標簽中去除任何非src=屬性。

請注意這不一定適用於所有輸入，因為 Anti-HTML + RegExp 人員在下面非常聰明地注意到。 有一些后備，最明顯的是<p style=">">最終會成為<p>">和其他一些損壞的問題......我建議將Zend_Filter_StripTags視為 PHP 中的完整證明標簽/屬性過濾器

Answer 2

你通常不會解析HTML應該使用正則表達式。

相反，您應該調用DOMDocument::loadHTML 。
然后您可以遞歸遍歷文檔中的元素並調用removeAttribute 。

Answer 3

好的，這是我使用的似乎運行良好的方法：

<([A-Z][A-Z0-9]*)(\b[^>src]*)(src\=[\'|"|\s]?[^\'][^"][^\s]*[\'|"|\s]?)?(\b[^>]*)>

隨意戳它的任何孔。

Answer 4

不幸的是，我不確定如何為 PHP 回答這個問題。 如果我使用 Perl，我會執行以下操作：

use strict;
my $data = q^<p id="paragraph" class="green">This is a paragraph with an image <img src="/path/to/image.jpg" width="50" height="75"/></p>^;

$data =~ s{
    <([^/> ]+)([^>]+)> # split into tagtype, attribs
}{
    my $attribs = $2;
    my @parts = split( /\s+/, $attribs ); # separate by whitespace
    @parts = grep { m/^src=/i } @parts;   # retain just src tags
    if ( @parts ) {
        "<" . join( " ", $1, @parts ) . ">";
    } else {
        "<" . $1 . ">";
    }
}xseg;

print( $data );

返回

<p>This is a paragraph with an image <img src="/path/to/image.jpg"></p>

Answer 5

發布為Oracle Regex提供解決方案

<([^!][a-z][a-z0-9]*)([^>]*(\ssrc=[''''\"][^''''\"]*[''''\"]))?[^>]*?(\/?)>

Answer 6

不要使用正則表達式來解析有效的 html。 僅當所有可用的 DOM 解析器都失敗時，才使用正則表達式來解析 html 文檔。 我超級喜歡正則表達式，但正則表達式是“DOM-ignorant”，它會悄悄地失敗和/或改變你的文檔。

我通常更喜歡 DOMDocument 和 XPath 的混合，以簡潔、直接和直觀地定位文檔實體。

除了少數幾個小例外，XPath 表達式與簡單英語中的邏輯非常相似。

//@*[not(name()="src")]

在文檔中的任何級別 ( // )
查找任何屬性（ @* ）
滿足這些要求 ( [] )
那不是（ not() ）
命名為“src”（ name()="src" ）

這更具可讀性、吸引力、廣告可維護性。

代碼：（演示）

$html = <<<HTML
<p id="paragraph" class="green">
    This is a paragraph with an image <img src="/path/to/image.jpg" width="50" height="75"/>
</p>
HTML;

$dom = new DOMDocument;
$dom->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$xpath = new DOMXPath($dom);
foreach ($xpath->query('//@*[not(name()="src")]') as $attr) {
    $attr->parentNode->removeAttribute($attr->nodeName);
}
echo $dom->saveHTML();

輸出：

<p>
    This is a paragraph with an image <img src="/path/to/image.jpg">
</p>

如果要添加另一個豁免屬性，可以使用or

//@*[not(name()="src" or name()="href")]

Answer 7

如上所述，您應該使用正則表達式來解析 html 或 xml。

我會用 str_replace() 來做你的例子； 如果它的所有時間都一樣。

$str = '<p id="paragraph" class="green">This is a paragraph with an image <img src="/path/to/image.jpg" width="50" height="75"/></p>';

$str = str_replace('id="paragraph" class="green"', "", $str);

$str = str_replace('width="50" height="75"',"",$str);

去除除 src 之外的所有 HTML 屬性

問題描述

6 個解決方案

解決方案1
21 2010-06-08 21:52:53

解決方案2
8 2010-06-08 02:34:55

解決方案3
1 已采納 2010-06-08 21:32:41

解決方案4
0 2010-06-08 08:40:59

解決方案5
0 2015-06-17 04:37:09

解決方案6
0 2021-01-15 22:24:24

解決方案7
-1 2010-06-08 22:28:54

去除除 src 之外的所有 HTML 屬性

問題描述

6 個解決方案

解決方案1 21 2010-06-08 21:52:53

解決方案2 8 2010-06-08 02:34:55

解決方案3 1 已采納 2010-06-08 21:32:41

解決方案4 0 2010-06-08 08:40:59

解決方案5 0 2015-06-17 04:37:09

解決方案6 0 2021-01-15 22:24:24

解決方案7 -1 2010-06-08 22:28:54

解決方案1
21 2010-06-08 21:52:53

解決方案2
8 2010-06-08 02:34:55

解決方案3
1 已采納 2010-06-08 21:32:41

解決方案4
0 2010-06-08 08:40:59

解決方案5
0 2015-06-17 04:37:09

解決方案6
0 2021-01-15 22:24:24

解決方案7
-1 2010-06-08 22:28:54