简体   繁体   English

用php中的preg_match_all匹配标签whitin标签

[英]Match tags whitin tags with preg_match_all in php

I'm currently working on a way to parse a HTML-document into a database. 我目前正在研究一种将HTML文档解析到数据库中的方法。 I'm not allowed to change any formatting from the HTML document. 我不允许更改HTML文档中的任何格式。 In the following example i need to find which tags have class id "Category", and then grab the data within this tag, in this example "Example Text". 在下面的示例中,我需要查找哪些标签的类ID为“ Category”,然后在此示例中的“示例文本”中获取该标签中的数据。

How do I get the code to not only match tags which are directly ended afterwards? 如何获得不仅匹配之后直接结束的标签的代码?

$tags = "<p class=Category style='margin-left:0in;text-indent:0in'><a name='_
Toc390163149'></a><a name='_Ref388370252'></a><a
name='_Toc122858606'><span lang=EN-GB>3.<span style='font:7.0pt 'Times New 
Roman''>&nbsp;</span></span><span lang=EN-GB>Example Text</span></a></p>";

preg_match_all("/(<([\w]+)[^>]*>)(.*?)(<\/\\2>)/", $tags, $matches, PREG_SET_
        foreach ($matches as $val) {
            echo "matched: " . htmlspecialchars($val[0]) . "</br>";
            echo "part 1: " . htmlspecialchars($val[1]) . "</br>";
            echo "part 2: " . htmlspecialchars($val[2]) . "</br>";
            echo "part 3: " . htmlspecialchars($val[3]) . "</br>";
            echo "part 4: " . htmlspecialchars($val[4]) . "</br></br>";
        }

Outputs: 输出:

matched: <a name="_Toc390163149"></a>
part 1: <a name="_Toc390163149">
part 2: a
part 3:
part 4: </a>

matched: <a name="_Ref388370252"></a>
part 1: <a name="_Ref388370252">
part 2: a
part 3:
part 4: </

matched: <span lang=EN-GB>When not to follow Rules</span>
part 1: <span lang=EN-GB>
part 2: span
part 3: When not to follow Rules
part 4: </span>

Any ideas? 有任何想法吗?

Short answer, you can't parse complicated data formats such as HTML with regex, or at least you shouldn't. 简短的答案是,您不能使用正则表达式解析复杂的数据格式(例如HTML),或者至少不应该。

Long answer, PHP provides a number of libraries for parsing HTML that would be both far less effort and far less prone to errors than the regex solution would be. 长话短说,PHP提供了许多用于解析HTML的库,与正则表达式解决方案相比,这既省事又省力。 The two of interest are going to be SimpleXML (if you're parsing XHTML) and DOMDocument (if you're parsing markup that may or may not be XML). 感兴趣的两个将是SimpleXML(如果您正在解析XHTML)和DOMDocument(如果您正在解析可能是XML或不是XML的标记)。 I'd be inclined to use the latter for HTML. 我倾向于将后者用于HTML。

Once you've loaded the markup into a DOMDocument, you can use an XPath query to locate all the p.category tags and iterate over them to get their child nodes and content. 将标记加载到DOMDocument中之后,可以使用XPath查询来找到所有p.category标签,并对其进行迭代以获取其子节点和内容。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM