Regex for an html text code in Java

Question

I have a html text file that has headings I would like to extract the only the text inside

Example:

<h1 class="title"><a href="dtb.htm#rgn_txt_0001_0001">Fire Safety</a></h1>
<h1><a href="dtb.htm#rgn_txt_0002_0001">About this book</a></h1>
<h1><a href="dtb.htm#rgn_par_0002_0008">1</a></h1>
<h1><a href="dtb.htm#rgn_txt_0003_0001">Contents of this book</a></h1>

I would like extract only the following text from HTML code:

Fire Safety, About this book, 1, Contents of this book

I tried lot of things like:

Pattern pattern = Pattern.compile("<a[^>]href\\s=\\s*\"\\s*([^\"]*)");
Matcher matcher = pattern.matcher(input);

where input is the html data.

Didn't get any results on the console or sometimes are i am getting only href :(

How do I get to fix this?

Let me know! Thanks!

Answer 1

我强烈建议使用HTML解析器，如TagSoup ， Jericho ， NekoHTML ， HTML Parser等

Regex for an html text code in Java

Question

1 answers

solution1
3 2012-12-18 07:08:07

Regex for an html text code in Java

Question

1 answers

solution1 3 2012-12-18 07:08:07

solution1
3 2012-12-18 07:08:07