[英]Extracting specific text from HTML texts
I am not so familiar with regex.我对正则表达式不太熟悉。 I am trying to obtain the results described at the bottom.我试图获得底部描述的结果。 Here is what I have done so far (note that $page
contains tabulators):这是我到目前为止所做的(注意$page
包含制表符):
$page = "<div class=\"title-container\">
<h1>Text here<span> /Sub-text/</span> </h1>
</div>";
// TITLE
preg_match_all ('/<h1>(.*)<\/h1>/U', $page, $out);
$hutitle = preg_replace("#<span>(.*)<\/span>\s#", "", $out[1][0]);
$entitle = preg_replace("'(.*)<span> /'", "", $out[1][0]);
I would like to get this:我想得到这个:
$hutitle = "Text here";
$entitle = "Sub-text"; (Without html and "/")
try this尝试这个
<h1>(.*?)<span> /(.*?)/</span>
$1 and $2 are the results as you expected. $1 和 $2 是您预期的结果。
I'd suggest using DOM with trim
, no need for regex, here is a working code for your concrete case:我建议将 DOM 与trim
一起使用,不需要正则表达式,这是针对您的具体情况的工作代码:
$page = "<div class=\"title-container\">\n <h1>Text here<span> /Sub-text/</span> </h1>\n </div>";
$dom = new DOMDocument;
$dom->loadHTML($page);
$hs = $dom->getElementsByTagName('h1');
foreach ($hs as $h) {
$enttitlenodes = $h->getElementsByTagName('span');
if ($enttitlenodes->length > 0 && $enttitlenodes->item(0)->tagName == 'span')
{
$entitle = trim($enttitlenodes->item(0)->nodeValue, " /");
echo $entitle . "\n";
$h->removeChild($enttitlenodes->item(0));
}
$hutitle = $h->nodeValue;
echo $hutitle;
}
See IDEONE demo看IDEONE 演示
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.