简体   繁体   English

PHP:strip_tags - 只删除某些标签(及其内容)?

[英]PHP: strip_tags - remove only certain tags (and their contents)?

I use the strip_tags() function but I need to remove some tags (and all of their contents). 我使用strip_tags()函数,但我需要删除一些标签(及其所有内容)。

for example : 例如 :

<div>
  <p class="test">
    Test A
  </p>
  <span>
    Test B
  </span>
  <div>
    Test C
  </div>
</div>

Let's say, I need to get rid of the P and SPAN tags, and only keep : 让我们说,我需要摆脱P和SPAN标签,并且只保留:

<div>
  <div>
    Test C
  </div>
</div>

strip_tags expects as a second parameter the tags that you want to KEEP. strip_tags期望将要strip_tags的标记作为第二个参数。

In this particular example I could use striptags($html, "<div>"); 在这个特定的例子中,我可以使用striptags($html, "<div>"); but the html I'm scraping and the tags that need to be removed are different all the time. 但我正在抓取的HTML和需要删除的标签总是不同的。

I searched for hours for a function that suits my needs, but couldn't find anything useful. 我搜索了几个小时的功能,以满足我的需求,但找不到任何有用的功能。

Any idea's? 有任何想法吗?

Use a regular expression. 使用正则表达式。 Something like this should work: 这样的事情应该有效:

$tags = array( 'p', 'span');
$text = preg_replace( '#<(' . implode( '|', $tags) . ')>.*?<\/$1>#s', '', $text);

The demo shows it replacing the desired tags with nothing. 演示显示它无需替换所需的标签。

Note that you may need to tweak it more, say, to compensate for whitespace within the tags, or other unknowns that your example does not demonstrate. 请注意,您可能需要对其进行更多调整,例如,以补偿​​标记中的空白或您的示例未演示的其他未知数。

Here is the regex to use to capture tags with or without attributes: 以下是用于捕获带或不带属性的标记的正则表达式:

'#<(' . implode( '|', $tags) . ')(?:[^>]+)?>.*?<\/$1>#s'

You say that you are using Simple HTML DOM (Good! That's the right way to parse HTML). 你说你使用的是简单的HTML DOM(好!这是解析HTML的正确方法)。 When I need to remove a tag and its contents, I do: 当我需要删除标签及其内容时,我会:

$rows = $html->find("span");

foreach ($rows as $row)
{
  $row->outertext = "";
}

$html->load($html->save());

The last line is required because the DOM gets confused after modifications are made so the entire DOM has to be collapsed and then parsed again so that the changes are made permanent (IMO, a bug in Simple HTML DOM). 最后一行是必需的,因为DOM在进行修改后会混淆,因此必须折叠整个DOM,然后再次进行解析,以使更改成为永久性的(IMO,简单HTML DOM中的错误)。

The Simple HTML DOM approach is safer and more stable than a regular expression. Simple HTML DOM方法比正则表达式更安全,更稳定。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM