简体   繁体   English

从 HTML 文本中提取特定文本

[英]Extracting specific text from HTML texts

I am not so familiar with regex.我对正则表达式不太熟悉。 I am trying to obtain the results described at the bottom.我试图获得底部描述的结果。 Here is what I have done so far (note that $page contains tabulators):这是我到目前为止所做的(注意$page包含制表符):

$page = "<div class=\"title-container\">
                            <h1>Text here<span> /Sub-text/</span> </h1>
                                                     </div>";
// TITLE
preg_match_all ('/<h1>(.*)<\/h1>/U', $page, $out);
$hutitle = preg_replace("#<span>(.*)<\/span>\s#", "", $out[1][0]);

$entitle = preg_replace("'(.*)<span> /'", "", $out[1][0]);

I would like to get this:我想得到这个:

$hutitle = "Text here"; 
$entitle = "Sub-text"; (Without html and "/")

try this尝试这个

<h1>(.*?)<span> /(.*?)/</span>

$1 and $2 are the results as you expected. $1 和 $2 是您预期的结果。

I'd suggest using DOM with trim , no need for regex, here is a working code for your concrete case:我建议将 DOM 与trim一起使用,不需要正则表达式,这是针对您的具体情况的工作代码:

$page = "<div class=\"title-container\">\n                            <h1>Text here<span> /Sub-text/</span> </h1>\n                                                     </div>";

$dom = new DOMDocument;
$dom->loadHTML($page);
$hs = $dom->getElementsByTagName('h1');
foreach ($hs as $h) {
    $enttitlenodes = $h->getElementsByTagName('span');
    if ($enttitlenodes->length > 0 && $enttitlenodes->item(0)->tagName == 'span')
    {
        $entitle = trim($enttitlenodes->item(0)->nodeValue, " /");
        echo $entitle . "\n";
        $h->removeChild($enttitlenodes->item(0)); 
    }
    $hutitle = $h->nodeValue;
    echo $hutitle;
}

See IDEONE demoIDEONE 演示

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM