简体   繁体   English

从HTML提取文本

[英]Extract text from HTML

Actors: example world 演员:世界榜样

this example word using regular expression in php ..... 这个示例单词在php中使用正则表达式...

preg_match('/<strong class="nfpd">Actors<\/strong>:([^<]+)<br \/>/', $text, $matches);

print_r($matches);

Like Gumbo already pointed out in the comments to this question and like you have also been told in a number of your previous questions as well, Regex aint the right tool for parsing HTML . 就像Gumbo在此问题的注释中已经指出的一样,就像您在以前的许多问题中也被告知过一样, Regex并不是解析HTML的正确工具

The following will use DOM to get the first following sibling of any strong elements with a class attribute of nfpd . 下面将使用DOM获取具有class属性nfpd的所有强元素的第一个后继兄弟。 In the case of the example HTML, this would be the content of the TextNode, eg : example world . 在示例HTML的情况下,这将是TextNode的内容,例如: example world

Example HTML: 示例HTML:

$html = <<< HTML
<p>
    <strong class="nfpd">Actors</strong>: example world <br />
    something else
</p>
HTML;

And extraction with DOM 并使用DOM提取

libxml_use_internal_errors(TRUE);
$dom = new DOMDocument();
$dom->loadHTML($html);
$xPath = new DOMXPath($dom);
libxml_clear_errors();

$nodes = $xPath->query('//strong[@class="nfpd"]/following-sibling::text()[1]');
foreach($nodes as $node) {
    echo $node->nodeValue; // : example world 
}

You can also do it withouth an XPath, though it gets more verbose then: 您也可以在没有XPath的情况下执行此操作,尽管它会变得更加冗长:

$nodes = $dom->getElementsByTagName('strong');
foreach($nodes as $node) {
    if($node->hasAttribute('class') &&
       $node->getAttribute('class') === 'nfpd' &&
       $node->nextSibling) {
        echo $node->nextSibling->nodeValue; // : example world 
    }
}

Removing the colon and whitespace is trivial: Use trim . 删除冒号和空格很简单:使用trim

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM