简体   繁体   中英

What is the fastest way to remove html tags from a document in java?

I have bunch of web document and want to remove the html tags from it. I saw some posts on StackOverflow on how to do in java, all from regex to HtmlCleaner and Jsoup.

I am interested in finding the fastest way to do it. I have millions of documents, so performance is crucial in my case. I can even trade a bit of quality for the performance.

Thanks for any answers in advance.

My opinion is to use as much as possible stream/SAX processing: 1) because it uses less memory 2) it is fast 3) can be more easier parallelized (consequence of low memory consumption)

Those factors are needed (from my pov) by your use cases where you have million of documents. please see there Wikipedia SAX

So if your Html is strict or XHTML. Use XSLT, and here is a tuto on how to transform XML (XHTML) using SAX XSLT+SAX+Java .

And finally, if you DON'T have an XML valid HTML please, look at this Java: Replace Strings in Streams, Arrays, Files etc. which make use of stream (and PushBackReader).

HTH

1) if html is proper xml then you can create its document object and remove the node.

2) if it is not proper xml then read entire html as string & and use replace function to remove "html" sunbstring.

If HTMl is not proper xml then regex is fastest way to replace in a string.

Seems like the java regexp is the fastest solution. However, it degrades the quality of the text obtained after.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM