简体繁体中英

What is the fastest way to remove html tags from a document in java?

原文 2012-01-03 10:25:11 0 3 java/ html-parsing/ jsoup/ html/ htmlcleaner

I have bunch of web document and want to remove the html tags from it. I saw some posts on StackOverflow on how to do in java, all from regex to HtmlCleaner and Jsoup.

I am interested in finding the fastest way to do it. I have millions of documents, so performance is crucial in my case. I can even trade a bit of quality for the performance.

Thanks for any answers in advance.

3 answers

My opinion is to use as much as possible stream/SAX processing: 1) because it uses less memory 2) it is fast 3) can be more easier parallelized (consequence of low memory consumption)

Those factors are needed (from my pov) by your use cases where you have million of documents. please see there Wikipedia SAX

So if your Html is strict or XHTML. Use XSLT, and here is a tuto on how to transform XML (XHTML) using SAX XSLT+SAX+Java .

And finally, if you DON'T have an XML valid HTML please, look at this Java: Replace Strings in Streams, Arrays, Files etc. which make use of stream (and PushBackReader).

HTH

1) if html is proper xml then you can create its document object and remove the node.

2) if it is not proper xml then read entire html as string & and use replace function to remove "html" sunbstring.

If HTMl is not proper xml then regex is fastest way to replace in a string.

Seems like the java regexp is the fastest solution. However, it degrades the quality of the text obtained after.

What is the fastest way to get a HTML Content using java?

how to remove attribute from html tags in java

Java Regular Expression to Remove tags from html

Fastest way to remove a Collection of Longs from another in Java

What is the absolute fastest way to read and write strings from a file with Java?

What is the fastest way to import data from application (Java) into temporary table?

What is the fastest way to get data into Cassandra 2 from a Java application?

What would be the fastest way to read integers from a file in Java?

What's the fastest way to read from System.in in Java?

Java, IO - fastest way to remove file

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question What is the fastest way to get a HTML Content using java? how to remove attribute from html tags in java Java Regular Expression to Remove tags from html Fastest way to remove a Collection of Longs from another in Java What is the absolute fastest way to read and write strings from a file with Java? What is the fastest way to import data from application (Java) into temporary table? What is the fastest way to get data into Cassandra 2 from a Java application? What would be the fastest way to read integers from a file in Java? What's the fastest way to read from System.in in Java? Java, IO - fastest way to remove file

Related Tags

What is the fastest way to remove html tags from a document in java?

Question

3 answers

solution1
1 2012-01-03 13:24:52

solution2
0 2012-01-03 12:03:06

solution3
0 ACCPTED 2012-01-10 13:01:08

What is the fastest way to remove html tags from a document in java?

Question

3 answers

solution1 1 2012-01-03 13:24:52

solution2 0 2012-01-03 12:03:06

solution3 0 ACCPTED 2012-01-10 13:01:08

solution1
1 2012-01-03 13:24:52

solution2
0 2012-01-03 12:03:06

solution3
0 ACCPTED 2012-01-10 13:01:08