[英]Extract text from HTML
Actors: example world 演员:世界榜样
this example word using regular expression in php ..... 这个示例单词在php中使用正则表达式...
preg_match('/<strong class="nfpd">Actors<\/strong>:([^<]+)<br \/>/', $text, $matches);
print_r($matches);
Like Gumbo already pointed out in the comments to this question and like you have also been told in a number of your previous questions as well, Regex aint the right tool for parsing HTML . 就像Gumbo在此问题的注释中已经指出的一样,就像您在以前的许多问题中也被告知过一样, Regex并不是解析HTML的正确工具 。
The following will use DOM to get the first following sibling of any strong elements with a class attribute of nfpd
. 下面将使用DOM获取具有class属性
nfpd
的所有强元素的第一个后继兄弟。 In the case of the example HTML, this would be the content of the TextNode, eg : example world
. 在示例HTML的情况下,这将是TextNode的内容,例如
: example world
。
Example HTML: 示例HTML:
$html = <<< HTML
<p>
<strong class="nfpd">Actors</strong>: example world <br />
something else
</p>
HTML;
And extraction with DOM 并使用DOM提取
libxml_use_internal_errors(TRUE);
$dom = new DOMDocument();
$dom->loadHTML($html);
$xPath = new DOMXPath($dom);
libxml_clear_errors();
$nodes = $xPath->query('//strong[@class="nfpd"]/following-sibling::text()[1]');
foreach($nodes as $node) {
echo $node->nodeValue; // : example world
}
You can also do it withouth an XPath, though it gets more verbose then: 您也可以在没有XPath的情况下执行此操作,尽管它会变得更加冗长:
$nodes = $dom->getElementsByTagName('strong');
foreach($nodes as $node) {
if($node->hasAttribute('class') &&
$node->getAttribute('class') === 'nfpd' &&
$node->nextSibling) {
echo $node->nextSibling->nodeValue; // : example world
}
}
Removing the colon and whitespace is trivial: Use trim
. 删除冒号和空格很简单:使用
trim
。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.