简体   繁体   中英

PHP: How do I remove some tags when parsing an HTML page?

When parsing a part of a webpage(from a < div > with "parse-it" id), I'd like to get removed < script > tags and, what's more, 'href' attributes from < a > tags from there. Here you are my code:

$url = 'http://example.com/';
$ch = curl_init();
$timeout = 5;
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
$html = curl_exec($ch);
curl_close($ch);
$dom = new DOMDocument();
@$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$result = '';
foreach ($xpath->evaluate('//*[starts-with(@id, "parse-it")]') as $childNode) {
$result .= $dom->saveHtml($childNode);
}
echo $result;

Any suggestions? Thank you in advance.

UPD: document example: https://jsfiddle.net/azt97tm4/

You can do it with STR_Replace.

http://php.net/manual/en/function.str-replace.php

 $result .= $dom->saveHtml($childNode); $target = array("<script>", "www.example.com"); $modify = array("", "google"); $output = str_replace($target, $modify, $result); } echo $output; 

Try this. If any problem ask me.

$dom = new DOMDocument();
@$dom->loadHTML($html);
$xpath = new DOMXPath($dom);

foreach ( $xpath->query('//div[starts-with(@id, "parse-it")]//script') as $badScriptNode) {

    $badScriptNode->parentNode->removeChild($badScriptNode);
}

foreach ( $xpath->evaluate('//div[starts-with(@id, "parse-it")]//a[@href]') as $badAnchorNode) {

    $badAnchorNode->removeAttribute("href");
}

echo $dom->saveHTML();

The following XSLT code removes all script elements and a/@href attributes from an XML document. I've used XSLT 1.0 here, because although XSLT 3.0 makes it a little shorter (and is available for PHP by installing the relevant Saxon library), XSLT 1.0 is still more widely used by PHP users.

<xsl:transform version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

<!-- default template copies everything unchanged -->

<xsl:template match="node()|@*">
  <xsl:copy>
    <xsl:apply-templates select="node()|@*"/>
  </xsl:copy>
</xsl:template>

<!-- drop script elements -->

<xsl:template match="script"/>

<!-- drop a/@href attributes -->

<xsl:template match="a/@href"/>

</xsl:transform>

Note that XSLT (like XPath) is defined to operate on XML rather than HTML, so you may need to do an initial conversion - I don't know the PHP world well enough to know the details. You may also need to make changes if the source document uses namespaces.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM