I have a HTML code like this :
<p> This is paragraph 1 </p>
This is paragraph 2
<p> This is paragraph 3 </p>
I will process the above HTML in Java and I wanted the processed HTML to be :
<p> This is paragraph 1 </p>
<p> This is paragraph 2 </p>
<p> This is paragraph 3 </p>
And there might be big paragraphs with more than one line. Therefore, line by line processing doesn't work in this case. For example,
<p> ...
...
...
</p>
And there can be cases like,
<p> This </p> can be <p> the case too. </p>
I need the above line to be converted to :
<p> This </p><p> can be </p><p> the case too. </p>
I want to achieve this because Jsoup doesn't identify text without < p > tags. If Jsoup can do that by any means, I am happy with that too. I don't want any text to be missed out from the document.
JSoup can give you the parts which are not in <P>
. Since they are not enclosed in a tag, they are text nodes rather than elements. So you should traverse the nodes rather than elements. Here is an example:
public class SimpleTest {
public static final String HTML = "<p> This is paragraph 1 </p>\n"
+ "This is paragraph 2\n"
+ "<p> This is paragraph 3 </p>";
public static void main(String[] args) {
Document doc = Jsoup.parse(HTML);
List<Node> nodes = doc.body().childNodes();
for ( Node node : nodes ) {
System.out.printf("Node of %s, %s%n", node.getClass(), node);
}
}
}
The output is:
Node of class org.jsoup.nodes.Element, <p> This is paragraph 1 </p>
Node of class org.jsoup.nodes.TextNode, This is paragraph 2
Node of class org.jsoup.nodes.Element, <p> This is paragraph 3 </p>
So when you want to do something practical with an unknown node, you should test it with instanceof
to see if it's a TextNode
, an Element
or something else. Then you cast it to the relevant class, and you can use all its methods in addition to the ones that are available in Node
.
Have you tried writing a parser to check if there's ap tag at the beginning of each line?
Code could look something like that:
String[] splitted = html_code.split("\n");
String solution="";
for(String s : splitted){
s = s.trim();
if(s.startsWith("<p>"){
solution+=s;
}else{
solution = "<p>"+s+"</p>;
}
}
Cool, I figured it out with RealSkeptic's help. This is just the improved version of the code appropriate for the question :
for( Node node : nodes ) {
if( node.getClass() == Element.class ) {
Element element = (Element) node;
System.out.println( element.tag() + " " + element.text() );
}
else if( node.getClass() == TextNode.class && ! node.toString().trim().isEmpty() )
System.out.println( node.toString().trim() );
}
Here I checked node isEmpty because Jsoup considers each line as a TextNode.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.