从HTML提取文本

Question

Actors: example world 演员：世界榜样

this example word using regular expression in php ..... 这个示例单词在php中使用正则表达式...

Answer 1

preg_match('/<strong class="nfpd">Actors<\/strong>:([^<]+)<br \/>/', $text, $matches);

print_r($matches);

Answer 2

Like Gumbo already pointed out in the comments to this question and like you have also been told in a number of your previous questions as well, Regex aint the right tool for parsing HTML . 就像Gumbo在此问题的注释中已经指出的一样，就像您在以前的许多问题中也被告知过一样， Regex并不是解析HTML的正确工具。

The following will use DOM to get the first following sibling of any strong elements with a class attribute of nfpd . 下面将使用DOM获取具有class属性nfpd的所有强元素的第一个后继兄弟。 In the case of the example HTML, this would be the content of the TextNode, eg : example world . 在示例HTML的情况下，这将是TextNode的内容，例如: example world 。

Example HTML: 示例HTML：

$html = <<< HTML
<p>
    <strong class="nfpd">Actors</strong>: example world <br />
    something else
</p>
HTML;

And extraction with DOM 并使用DOM提取

libxml_use_internal_errors(TRUE);
$dom = new DOMDocument();
$dom->loadHTML($html);
$xPath = new DOMXPath($dom);
libxml_clear_errors();

$nodes = $xPath->query('//strong[@class="nfpd"]/following-sibling::text()[1]');
foreach($nodes as $node) {
    echo $node->nodeValue; // : example world 
}

You can also do it withouth an XPath, though it gets more verbose then: 您也可以在没有XPath的情况下执行此操作，尽管它会变得更加冗长：

$nodes = $dom->getElementsByTagName('strong');
foreach($nodes as $node) {
    if($node->hasAttribute('class') &&
       $node->getAttribute('class') === 'nfpd' &&
       $node->nextSibling) {
        echo $node->nextSibling->nodeValue; // : example world 
    }
}

Removing the colon and whitespace is trivial: Use trim . 删除冒号和空格很简单：使用trim 。

从HTML提取文本

问题描述

2 个解决方案

解决方案1
1 2010-08-01 13:49:06

解决方案2
1 2010-08-01 14:26:10

从HTML提取文本

问题描述

2 个解决方案

解决方案1 1 2010-08-01 13:49:06

解决方案2 1 2010-08-01 14:26:10

解决方案1
1 2010-08-01 13:49:06

解决方案2
1 2010-08-01 14:26:10