Add <p> tags programatically in Java

Question

I have a HTML code like this :

<p> This is paragraph 1 </p>
This is paragraph 2
<p> This is paragraph 3 </p>

I will process the above HTML in Java and I wanted the processed HTML to be :

<p> This is paragraph 1 </p>
<p> This is paragraph 2 </p>
<p> This is paragraph 3 </p>

And there might be big paragraphs with more than one line. Therefore, line by line processing doesn't work in this case. For example,

<p> ...
...
...
</p>

And there can be cases like,

<p> This </p> can be <p> the case too. </p>

I need the above line to be converted to :

<p> This </p><p> can be </p><p> the case too. </p>

I want to achieve this because Jsoup doesn't identify text without < p > tags. If Jsoup can do that by any means, I am happy with that too. I don't want any text to be missed out from the document.

Answer 1

JSoup can give you the parts which are not in <P> . Since they are not enclosed in a tag, they are text nodes rather than elements. So you should traverse the nodes rather than elements. Here is an example:

public class SimpleTest {

    public static final String HTML = "<p> This is paragraph 1 </p>\n"
                                    + "This is paragraph 2\n"
                                    + "<p> This is paragraph 3 </p>";

    public static void main(String[] args) {

        Document doc = Jsoup.parse(HTML);

        List<Node> nodes = doc.body().childNodes();

        for ( Node node : nodes ) {
            System.out.printf("Node of %s, %s%n", node.getClass(), node);
        }
    }
}

The output is:

Node of class org.jsoup.nodes.Element, <p> This is paragraph 1 </p>
Node of class org.jsoup.nodes.TextNode,  This is paragraph 2 
Node of class org.jsoup.nodes.Element, <p> This is paragraph 3 </p>

So when you want to do something practical with an unknown node, you should test it with instanceof to see if it's a TextNode , an Element or something else. Then you cast it to the relevant class, and you can use all its methods in addition to the ones that are available in Node .

Answer 2

Have you tried writing a parser to check if there's ap tag at the beginning of each line?

Code could look something like that:

String[] splitted = html_code.split("\n");
String solution="";
for(String s : splitted){
    s = s.trim();
    if(s.startsWith("<p>"){
        solution+=s;
    }else{
        solution = "<p>"+s+"</p>;
    }
}

Answer 3

Cool, I figured it out with RealSkeptic's help. This is just the improved version of the code appropriate for the question :

for( Node node : nodes ) {
       if( node.getClass() == Element.class ) {
           Element element = (Element) node; 
           System.out.println( element.tag() + " " + element.text() );
       }
       else if( node.getClass() == TextNode.class && ! node.toString().trim().isEmpty() )
           System.out.println( node.toString().trim() );
}

Here I checked node isEmpty because Jsoup considers each line as a TextNode.

Add <p> tags programatically in Java

Question

3 answers

solution1
2 ACCPTED 2015-10-31 08:41:03

solution2
0 2015-10-31 08:16:17

solution3
0 2015-10-31 09:40:16

Add <p> tags programatically in Java

Question

3 answers

solution1 2 ACCPTED 2015-10-31 08:41:03

solution2 0 2015-10-31 08:16:17

solution3 0 2015-10-31 09:40:16

solution1
2 ACCPTED 2015-10-31 08:41:03

solution2
0 2015-10-31 08:16:17

solution3
0 2015-10-31 09:40:16