From a string that contains a lot of HTML, how can I extract all the text from <h1><h2>etc
tags into a new variable?
I would like to capture all of the text from these elements and store them in a new variable as comma-delimited values.
Is it possible using preg_match_all()
?
First you need to clean up the HTML ($html_str in the example) with tidy:
$tidy_config = array(
"indent" => true,
"output-xml" => true,
"output-xhtml" => false,
"drop-empty-paras" => false,
"hide-comments" => true,
"numeric-entities" => true,
"doctype" => "omit",
"char-encoding" => "utf8",
"repeated-attributes" => "keep-last"
);
$xml_str = tidy_repair_string($html_str, $tidy_config);
Then you can load the XML ($xml_str) into a DOMDocument:
$doc = DOMDocument::loadXML($xml_str);
And finally you can use Horia Dragomir's method:
$list = $doc->getElementsByTagName("h1");
for ($i = 0; $i < $list->length; $i++) {
print($list->item($i)->nodeValue . "<br/>\n");
}
Or you could also use XPath for more complex queries on the DOMDocument (see http://www.php.net/manual/en/class.domxpath.php )
$xpath = new DOMXPath($doc);
$list = $xpath->evaluate("//h1");
When the question is "How do I extract stuff from HTML", the answer is NEVER to use regular expressions. Instead, see the discussion on Robust, Mature HTML Parser for PHP .
I know this is a super old post, however I wanted to mention the best way I was able to collectively grab heading tags.
<h1>title</h1> and <h2>title 2</h2>
This method (works as a regex, however PHP acts a bit differently.)
/<\s*h[1-2](?:.*)>(.*)</\s*h/i
use this in your preg_match
|<\s*h[1-2](?:.*)>(.*)</\s*h|Ui
$group[1]
will include what ever is in between the heading tag. $group[0]
is everything <h1>test</h
This will account for spaces, and if someone adds "class/id"
<h1 class="classname">test</h1>
the class/id (group) is ignored.
NOTE : When I analyze HTML tags, I always strip out and replace all White space, line breaks, tabs etc.. with a 1 space. This minimizes multi-lines, dotalls... And very large amounts of white space which in some cases can mess with regex formatting.
Here is a link to the test page regex test
You're probably better of using an HTML parser. But for really simple scenarios, something like this might do:
if (preg_match_all('/<h\d>([^<]*)<\/h\d>/iU', $str, $matches)) {
// $matches contains all instances of h1-h6
}
It is recommended not to use regex for this job and use something SimpleHTMLDOM parser
If you actually want to use regular expressions, I think that:
preg_match_all('/<h[0-6]>([^</h[0-6]>*)</h/i', $string, $matches);
should work as long as your header tags are not nested. As others have said, if you're not in control of the HTML, regular expressions are not a great way to do this.
please also consider the native DOMDocument
php class.
You can use $domdoc->getElementsByTagName('h1')
to get your headings.
I just want to share my solution:
function get_all_headings( $content ) {
preg_match_all( '/\<(h[1-6])\>(.*)<\/h[1-6]>/i', $content, $matches );
$r = array();
if( !empty( $matches[1] ) && !empty( $matches[2] ) ){
$tags = $matches[1];
$titles = $matches[2];
foreach ($tags as $i => $tag) {
$r[] = array( 'tag' => $tag, 'title' => $titles[ $i ] );
}
}
return $r;
}
This function will return an empty array if headings were not found or something like this:
array (
array (
'tag' => 'h1',
'title' => 'This is a title',
),
array (
'tag' => 'h2',
'title' => 'This is the second title',
),
)
this is an old questions, since there is no newer answers. i wrote this with php built in dom parser.
$dom -> loadHTML("your html string here..");
$h2s = $dom -> getElementsByTagName('h2');
foreach ( $h2s as $h2 )
{
echo $h2 -> nodeValue;
}
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.