Getting innertext of HTML tags using Regular Expressions

Question

I'm having trouble capturing this data:

              <tr>
                <td><span class="bodytext"><b>Contact:</b><b></b></span><span style='font-size:10.0pt;font-family:Verdana;
  mso-bidi-font-family:Arial'><b> </b> 
                      <span class="bodytext">John Doe</span> 
                     </span></td>
              </tr>
              <tr>
                <td><span class="bodytext">PO Box 2112</span></td>
              </tr>
              <tr>
                <td><span class="bodytext"></span></td>
              </tr>

              <!--*********************************************************


              -->
              <tr>
                <td><span class="bodytext"></span></td>
              </tr>



              <tr>
                <td><span class="bodytext">JOHAN</span> NSW 9700</td>
              </tr>
              <tr>
                <td><strong>Phone:</strong> 
                02 9999 9999
                    </td>
              </tr>

Basically, I want to grab everything after "Contact:" and before "Phone:" minus the HTML; however these two designations may not always exist so I need to really grab everything between the two colons (:) that isn't located inside a HTML tag. The number of <span class="bodytext">***data***</span> may actually vary so I need some sort of loop for matching these.

I prefer to use regular expressions as I could probably do this using loops and string matches.

Also, I'd like to know the syntax for non-matching groups in PHP regex.

Any help would be greatly appreciated!

Answer 1

If I understand you correctly, you're only interested in the text between the HTML tags. To ignore the HTML tags, simply strip them first:

$text = preg_replace('/<[^<>]+>/', '', $html);

To grab everything between "Contact:" and "Phone:", use:

if (preg_match('/Contact:(.*?)Phone:/s', $text, $regs)) {
  $result = $regs[1];
} else {
  $result = "";
}

To grab everything between two colons, use:

if (preg_match('/:([^:]*):/', $text, $regs)) {
  $result = $regs[1];
} else {
  $result = "";
}

Answer 2

听起来像是抓屏，或者找到所需信息后也可以使用strip_tags（）。

Answer 3

The seemingly arbitrary stack overflow response to these sort of questions seems to be "omg don't use regexes! Use Beautiful Soup instead!!". Personally I prefer not having to use external libraries for small tasks like this, and regexes are a good alternative.

A simple way to strip out all the HTML tags, which is one way to tackle this, is to use this regex:

$text = preg_replace("/<.*?>/", "", $text);

then you can use whatever method you like to grab the appropriate text content.

Non matching groups are like this: (?:this won't match)

Getting innertext of HTML tags using Regular Expressions

Question

3 answers

solution1
2 2008-12-18 02:38:55

solution2
0 2009-10-05 13:33:27

solution3
0 2008-12-18 02:39:27

Getting innertext of HTML tags using Regular Expressions

Question

3 answers

solution1 2 2008-12-18 02:38:55

solution2 0 2009-10-05 13:33:27

solution3 0 2008-12-18 02:39:27

solution1
2 2008-12-18 02:38:55

solution2
0 2009-10-05 13:33:27

solution3
0 2008-12-18 02:39:27