What's the easiest way to extract a piece of data from HTML in PHP?

Question

I'm working with a small subset of mostly invalid HTML, and I need to extract a small piece of data. Given the fact that most of "markup" isn't valid, I don't think that loading everything into a DOM is a good option. Moreover, it seems like a lot of overhead for this simple case.

Here's an example of the markup that I have:

(a bunch of invalid markup here with unclosed tags, etc.)
<TD><span>Something (random text here)</span></TD>
(a bunch more invalid markup here with more unclosed tags.)

The <TD><span>Something (random text here)</span></TD> portion does not repeat itself anywhere in the document, so I believe a simple regex would do the trick.

However, I'm terrible with regular expressions.

Should I use a regular expression? Is there a more simple way to do this? If possible, I'd just like to extract the text after Something, the (random text here) portion.

Thanks in advance!

Edit -

Exact example of the HTML (I've omitted the stuff prior, which is the invalid markup that the vendor uses. It's irrelevant for this example, I believe):

<div class="FormTable">
        <TABLE>
        <TR>
                <TD colspan="2">In order to proceed with login operation please 
                answer on the security question below</TD>
        </TR>
        <TR>
                <TD colspan="2">&nbsp;</TD>
        </TR>
        <TR>
                <TD><label class="FormLabel">Security Question</label></TD>
                <TD><span>What is your city of birth?</span></TD>
        </TR>
        <TR>
                <TD><label class="FormLabel">Answer</label></TD>
                <TD><INPUT name="securityAnswer" class="input" type="password" value=""></TD>
        </TR>
        </TABLE>
</div>

Answer 1

If you're sure the opening and closing span tags are on a single line . . .

$ cat test.php
<?php
  $subject = "(a bunch of invalid markup here with unclosed tags, etc.)
              <TD><span>Something (random text here)</span></TD>
              (a bunch more invalid markup here with more unclosed tags.)";

  $pattern = '/<span>.*<\/span>/';

  preg_match($pattern, $subject, $matches);
  print_r($matches);

?>


$ php -f test.php
Array
(
    [0] => <span>Something (random text here)</span>
)

If you're not confident that the span tags are on the same line, you can treat the html as a text file, and grep for the span tags.

$ grep '[</]span>' yourfile.html

Answer 2

You might read through this answer and the other two it cites. Tackling invalid HTML a bit at a time is actually something you're apt to have easier luck with using regexes on than using full parsers.

Answer 3

Use of DOM parser is not optimal in your situation. I strongly believe that you need SAX parser, it just extract parts of your document and send appropriate events to your handlers. This method allows to parse broken documents easily.

Examples: http://pear.php.net/package/XML_HTMLSax3 http://www.php.net/manual/en/example.xml-structure.php

Answer 4

尝试使用DOMDOcument::loadHTML()方法，该方法应该消除与HTML相关的任何验证错误。

What's the easiest way to extract a piece of data from HTML in PHP?

Question

4 answers

solution1
2 ACCPTED 2011-02-08 15:09:31

solution2
1 2011-02-08 15:02:50

solution3
1 2011-02-08 17:38:32

solution4
0 2011-02-08 15:05:22

What's the easiest way to extract a piece of data from HTML in PHP?

Question

4 answers

solution1 2 ACCPTED 2011-02-08 15:09:31

solution2 1 2011-02-08 15:02:50

solution3 1 2011-02-08 17:38:32

solution4 0 2011-02-08 15:05:22

solution1
2 ACCPTED 2011-02-08 15:09:31

solution2
1 2011-02-08 15:02:50

solution3
1 2011-02-08 17:38:32

solution4
0 2011-02-08 15:05:22