简体   繁体   English

在PHP中从HTML提取数据的最简单方法是什么?

[英]What's the easiest way to extract a piece of data from HTML in PHP?

I'm working with a small subset of mostly invalid HTML, and I need to extract a small piece of data. 我正在处理大部分无效HTML的一小部分,并且需要提取一小部分数据。 Given the fact that most of "markup" isn't valid, I don't think that loading everything into a DOM is a good option. 考虑到大多数“标记”都是无效的,因此我认为将所有内容都加载到DOM中并不是一个好选择。 Moreover, it seems like a lot of overhead for this simple case. 而且,对于这种简单情况,似乎有很多开销。

Here's an example of the markup that I have: 这是我拥有的标记的示例:

(a bunch of invalid markup here with unclosed tags, etc.)
<TD><span>Something (random text here)</span></TD>
(a bunch more invalid markup here with more unclosed tags.)

The <TD><span>Something (random text here)</span></TD> portion does not repeat itself anywhere in the document, so I believe a simple regex would do the trick. <TD><span>Something (random text here)</span></TD>部分不会在文档中的任何位置重复其本身,因此我相信使用简单的正则表达式即可解决问题。

However, I'm terrible with regular expressions. 但是,我对正则表达式感到恐惧。

Should I use a regular expression? 我应该使用正则表达式吗? Is there a more simple way to do this? 有没有更简单的方法可以做到这一点? If possible, I'd just like to extract the text after Something, the (random text here) portion. 如果可能的话,我只想提取Something之后的文本(此处为随机文本)。

Thanks in advance! 提前致谢!

Edit - 编辑-

Exact example of the HTML (I've omitted the stuff prior, which is the invalid markup that the vendor uses. It's irrelevant for this example, I believe): HTML的确切示例(我已经省略了之前的内容,这是供应商使用的无效标记。我认为,该示例与该示例无关):

<div class="FormTable">
        <TABLE>
        <TR>
                <TD colspan="2">In order to proceed with login operation please 
                answer on the security question below</TD>
        </TR>
        <TR>
                <TD colspan="2">&nbsp;</TD>
        </TR>
        <TR>
                <TD><label class="FormLabel">Security Question</label></TD>
                <TD><span>What is your city of birth?</span></TD>
        </TR>
        <TR>
                <TD><label class="FormLabel">Answer</label></TD>
                <TD><INPUT name="securityAnswer" class="input" type="password" value=""></TD>
        </TR>
        </TABLE>
</div>  

If you're sure the opening and closing span tags are on a single line . 如果您确定打开和关闭span标签在同一行上。 . .

$ cat test.php
<?php
  $subject = "(a bunch of invalid markup here with unclosed tags, etc.)
              <TD><span>Something (random text here)</span></TD>
              (a bunch more invalid markup here with more unclosed tags.)";

  $pattern = '/<span>.*<\/span>/';

  preg_match($pattern, $subject, $matches);
  print_r($matches);

?>


$ php -f test.php
Array
(
    [0] => <span>Something (random text here)</span>
)

If you're not confident that the span tags are on the same line, you can treat the html as a text file, and grep for the span tags. 如果您不确定span标记在同一行上,则可以将html视为文本文件,并将grep用作span标记。

$ grep '[</]span>' yourfile.html

You might read through this answer and the other two it cites. 您可能会通读此答案以及它引用的其他两个答案 Tackling invalid HTML a bit at a time is actually something you're apt to have easier luck with using regexes on than using full parsers. 实际上,一次使用无效的 HTML代码比使用完整的解析器更容易让运气好一点。

Use of DOM parser is not optimal in your situation. 在您的情况下,使用DOM分析器并不是最佳选择。 I strongly believe that you need SAX parser, it just extract parts of your document and send appropriate events to your handlers. 我坚信您需要SAX解析器,它仅提取文档的一部分并将适当的事件发送给处理程序。 This method allows to parse broken documents easily. 此方法可以轻松解析损坏的文档。

Examples: http://pear.php.net/package/XML_HTMLSax3 http://www.php.net/manual/en/example.xml-structure.php 范例: http : //pear.php.net/package/XML_HTMLSax3 http://www.php.net/manual/en/example.xml-structure.php

尝试使用DOMDOcument::loadHTML()方法,该方法应该消除与HTML相关的任何验证错误。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 在PHP中加载一组文字数据的最简单方法是什么? - What's the easiest way to load a set of literal data in PHP? 从另一个数组的键中提取数组值的最简单方法是什么 - What is the easiest way to extract array's values from another array's keys 使用 PHP 从 MS Word 文档中提取图像的最简单方法是什么? - Easiest way to extract images from a MS Word Document using PHP? 使用PHP DOM函数从HTML文件提取数据的最佳方法是什么? - What is the best way to extract data from an HTML file using the PHP DOM functions? 用PHP重定向到上一页的最简单方法是什么? - What's the easiest way to redirect to the previous page with PHP? 解析这样的PHP字符串最简单的方法是什么 - what's the easiest way to parse a PHP string like this 用 PHP 的 mysqli 做准备好的语句的正确和最简单的方法是什么? - What is the correct and easiest way to do prepared statements with PHP's mysqli? 在Windows XP Professional中测试PHP的最简单方法是什么? - What's the easiest way to test PHP in Windows XP Professional? 通过PHP检索网页内容的最简单方法是什么? - What's the easiest way to retrieve the contents of a webpage via PHP? 将数据从PostgreSQL数据库移动到新的MySQL数据库的最简单方法是什么? - What's the easiest way to move data from a PostgreSQL database to a new MySQL database?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM