简体   繁体   中英

Using JSoup to remove only HTML tags and not data within '<' and '>' tags

I'm using JSoup to parse string which contains HTML tags to plain text. For example:

String newStr = Jsoup.parse(testStrHTML).text();

It is parsing it very well but problem is if my Java string contains a data between < and > for eg Hello <test@gmail.com> so it is removing email address data. Output I'm getting is Hello , where I'm expecting Hello <test@gmail.com> .

I have tried it with regular expression as well like

String newStr = testStrHTML.replaceAll("\\<.*?\\>", "");

But still problem.

Is there anyway to parse HTML tags without custom data between < and >

Your regexp

String newStr = testStrHTML.replaceAll("\\<.*?\\>", "");

Completly removes the tag. It matches the start of the < at the beginning of the tag, the label of the tags, any attributes of the tag and the final >. It then replaces this with an empty string.

String newStr = testStrHTML.replaceAll("\\<.([^>]*)\\>", "\\1");

Should replace all tags with the label and any attributes of the tag. This roughly matches the same as your regexp, but it replaces the match with the text within the brackets.

Note that this removes context so it might not be a good solution. It also doesn't produce easily readable output because valid html is partially retained.

It might be better to stay with Jsoup and navigate the DOM.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM