What is a proper regex to find all the variations of the HTML <td> using Java?

Question

I am trying to practice my skills by putting a formatted HTML table into a java matrix.

The problem is I am working with regexes and unfortunately they aren't working in the way I want.

For example, for the line:

<TD ALIGN="CENTER" colspan="14"><B class="useNavy">Computer Science</B><br></tr>

I am trying to "clean" the code by making TD ALIGN="CENTER" colspan="14" a plain td.

I use the following code where row contains that line:

row = row.replaceAll("<(td|TD)(.*)?>", "<td>");

I am expecting to get:

<td><B class="useNavy">Computer Science</B><br></tr>

But instead I get a single

<td>

What is wrong with my regex?

I thought I should tell the program to stop in the first match but it doesn't seem to work (replaceFirst) either.

I tried the following variations of the regex, but the same thing happens:

"<(td|TD).*>", "<(td|TD)(.*)>"

Answer 1

<(td|TD)[^>]*> should grab all the td elements in your document.

[^>]* is the key part. It means "get as many characters as you find that aren't the closing greater than character".

Answer 2

use this simple regex pattern

String p="(\\\\.td\\\\.B\\\\sclass.*)";

Hope this helps