I use a function to get the first "x" words of a string. Main part is:
preg_match_all('/(<\/?([\w+]+)[^>]*>)?([^<>]*)/', $text, $tags, PREG_SET_ORDER);
When a word is inside html - example:
<a href="/"><u>Linktext</u></a>
The regex see the word "linktext" as a word. Regex should be changed to skip every word that is inside a html tag.
Is this possible?
Use XSL transformations. I used template from related answer ( How to remove all text from an XML document ):
$string = '<a href="/">Some text <u>Linktext</u> more text</a>';
$xslTemplate = '<?xml version="1.0" encoding="utf-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
version="1.0">
<!-- copy all nodes -->
<xsl:template match="node()">
<xsl:copy>
<xsl:apply-templates select="@*|node()"/>
</xsl:copy>
</xsl:template>
<!-- clear attributes -->
<xsl:template match="@*">
<xsl:attribute name="{name()}" />
</xsl:template>
<!-- ignore text content of nodex -->
<xsl:template match="text()" />
</xsl:stylesheet>';
libxml_use_internal_errors(true);
$inputDom = new DOMDocument();
$inputDom->loadHTML($string);
$xslDom = new DOMDocument();
$xslDom->loadXML($xslTemplate);
$cp = new XSLTProcessor();
$cp->registerPHPFunctions();
$cp->importStylesheet($xslDom);
$transformedResult = $cp->transformToDoc($inputDom);
$transformedHtmlString = $transformedResult->saveXML($transformedResult->getElementsByTagName('body')->item(0));
$transformedHtmlString = str_replace('<body>','', $transformedHtmlString); //saveXml() method leaves automatically created body tag
$transformedHtmlString = str_replace('</body>','', $transformedHtmlString);
echo $transformedHtmlString;
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.