简体   繁体   中英

Regex for an html text code in Java

I have a html text file that has headings I would like to extract the only the text inside

Example:

<h1 class="title"><a href="dtb.htm#rgn_txt_0001_0001">Fire Safety</a></h1>
<h1><a href="dtb.htm#rgn_txt_0002_0001">About this book</a></h1>
<h1><a href="dtb.htm#rgn_par_0002_0008">1</a></h1>
<h1><a href="dtb.htm#rgn_txt_0003_0001">Contents of this book</a></h1>

I would like extract only the following text from HTML code:

Fire Safety, About this book, 1, Contents of this book

I tried lot of things like:

Pattern pattern = Pattern.compile("<a[^>]href\\s=\\s*\"\\s*([^\"]*)");
Matcher matcher = pattern.matcher(input);

where input is the html data.

Didn't get any results on the console or sometimes are i am getting only href :(

How do I get to fix this?

Let me know! Thanks!

我强烈建议使用HTML解析器,如TagSoupJerichoNekoHTMLHTML Parser

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM