从 HTML 文本中提取特定文本

Question

I am not so familiar with regex.我对正则表达式不太熟悉。 I am trying to obtain the results described at the bottom.我试图获得底部描述的结果。 Here is what I have done so far (note that $page contains tabulators):这是我到目前为止所做的（注意$page包含制表符）：

$page = "<div class=\"title-container\">
                            <h1>Text here<span> /Sub-text/</span> </h1>
                                                     </div>";
// TITLE
preg_match_all ('/<h1>(.*)<\/h1>/U', $page, $out);
$hutitle = preg_replace("#<span>(.*)<\/span>\s#", "", $out[1][0]);

$entitle = preg_replace("'(.*)<span> /'", "", $out[1][0]);

I would like to get this:我想得到这个：

$hutitle = "Text here"; 
$entitle = "Sub-text"; (Without html and "/")

Answer 1

try this尝试这个

<h1>(.*?)<span> /(.*?)/</span>

$1 and $2 are the results as you expected. $1 和 $2 是您预期的结果。

Answer 2

I'd suggest using DOM with trim , no need for regex, here is a working code for your concrete case:我建议将 DOM 与trim一起使用，不需要正则表达式，这是针对您的具体情况的工作代码：

$page = "<div class=\"title-container\">\n                            <h1>Text here<span> /Sub-text/</span> </h1>\n                                                     </div>";

$dom = new DOMDocument;
$dom->loadHTML($page);
$hs = $dom->getElementsByTagName('h1');
foreach ($hs as $h) {
    $enttitlenodes = $h->getElementsByTagName('span');
    if ($enttitlenodes->length > 0 && $enttitlenodes->item(0)->tagName == 'span')
    {
        $entitle = trim($enttitlenodes->item(0)->nodeValue, " /");
        echo $entitle . "\n";
        $h->removeChild($enttitlenodes->item(0)); 
    }
    $hutitle = $h->nodeValue;
    echo $hutitle;
}

See IDEONE demo看IDEONE 演示

从 HTML 文本中提取特定文本

问题描述

2 个解决方案

解决方案1
1 2015-07-06 09:29:15

解决方案2
1 已采纳 2015-07-06 09:40:04

从 HTML 文本中提取特定文本

问题描述

2 个解决方案

解决方案1 1 2015-07-06 09:29:15

解决方案2 1 已采纳 2015-07-06 09:40:04

解决方案1
1 2015-07-06 09:29:15

解决方案2
1 已采纳 2015-07-06 09:40:04