Jsoup: How to get all html between 2 header tags

Question

I am trying to get all html between 2 h1 tags. Actual task is to break the html into frames(chapters) based of the h1(heading 1) tags.

Appreciate any help.

Thanks Sunil

Answer 1

If you want to get and process all elements between two consecutive h1 tags you can work on siblings. Here's some example code:

public static void h1s() {
  String html = "<html>" +
  "<head></head>" +
  "<body>" +
  "  <h1>title 1</h1>" +
  "  <p>hello 1</p>" +
  "  <table>" +
  "    <tr>" +
  "      <td>hello</td>" +
  "      <td>world</td>" +
  "      <td>1</td>" +
  "    </tr>" +
  "  </table>" +
  "  <h1>title 2</h1>" +
  "  <p>hello 2</p>" +
  "  <table>" +
  "    <tr>" +
  "      <td>hello</td>" +
  "      <td>world</td>" +
  "      <td>2</td>" +
  "    </tr>" +
  "  </table>" +
  "  <h1>title 3</h1>" +
  "  <p>hello 3</p>" +
  "  <table>" +
  "    <tr>" +
  "      <td>hello</td>" +
  "      <td>world</td>" +
  "      <td>3</td>" +
  "    </tr>" +
  "  </table>" +    
  "</body>" +
  "</html>";
  Document doc = Jsoup.parse(html);
  Element firstH1 = doc.select("h1").first();
  Elements siblings = firstH1.siblingElements();
  List<Element> elementsBetween = new ArrayList<Element>();
  for (int i = 1; i < siblings.size(); i++) {
    Element sibling = siblings.get(i);
    if (! "h1".equals(sibling.tagName()))
      elementsBetween.add(sibling);
    else {
      processElementsBetween(elementsBetween);
      elementsBetween.clear();
    }
  }
  if (! elementsBetween.isEmpty())
    processElementsBetween(elementsBetween);
}

private static void processElementsBetween(
    List<Element> elementsBetween) {
  System.out.println("---");
  for (Element element : elementsBetween) {
    System.out.println(element);
  }
}

Answer 2

I don't know Jsoup that good, but a straight forward approach could look like this:

public class Test {

    public static void main(String[] args){

        Document document = Jsoup.parse("<html><body>" +
            "<h1>First</h1><p>text text text</p>" +
            "<h1>Second</h1>more text" +
            "</body></html>");

        List<List<Node>> articles = new ArrayList<List<Node>>();
        List<Node> currentArticle = null;

        for(Node node : document.getElementsByTag("body").get(0).childNodes()){
            if(node.outerHtml().startsWith("<h1>")){
                currentArticle = new ArrayList<Node>();
                articles.add(currentArticle);
            }

            currentArticle.add(node);
        }

        for(List<Node> article : articles){
            for(Node node : article){
                System.out.println(node);
            }
            System.out.println("------- new page ---------");
        }

    }

}

Do you know the structure of the articles and is it always the same? What do you want to do with the articles? Have you considered splitting them on the client side? This would be an easy jQuery Job.

Answer 3

Iterating over the elements between consecutive <h> elements seems to be fine, except one thing. Text not belonging to any tag, like in <h1/>this<h1/> . To workaround this I implemented splitElemText function to get this text. First split whole parent element using this method. Then except the element, process the suitable entry from the splitted text. Remove calls to htmlToText if you want raw html.

/** Splits the text of the element <code>elem</code> by the children
  * tags.
  * @return An array of size <code>c+1</code>, where <copde>c</code>
  * is the number of child elements.
  * <p>Text after <code>n</code>th element is found in <code>[n+1]</code>.
  */
public static String[] splitElemText(Element elem)
{
  int c = elem.children().size();
  String as[] = new String[c + 1];
  String sAll = elem.html();
  int iBeg = 0;
  int iChild = 0;
  for (Element ch : elem.children()) {
    String sChild = ch.outerHtml();
    int iEnd = sAll.indexOf(sChild, iBeg);
    if (iEnd < 0) { throw new RuntimeException("Tag " + sChild
                    +" not found in its parent: " + sAll);
    }
    as[iChild] = htmlToText(sAll.substring(iBeg, iEnd));
    iBeg = iEnd + sChild.length();
    iChild += 1;
  }
  as[iChild] = htmlToText(sAll.substring(iBeg));
  assert(iChild == c);
  return as;
}

public static String htmlToText(String sHtml)
{
  Document doc = Jsoup.parse(sHtml);
  return doc.text();
}

Jsoup: How to get all html between 2 header tags

Question

3 answers

solution1
5 ACCPTED 2011-06-30 14:13:55

solution2
4 2011-06-30 13:04:34

solution3
2 2012-04-25 06:03:02

Jsoup: How to get all html between 2 header tags

Question

3 answers

solution1 5 ACCPTED 2011-06-30 14:13:55

solution2 4 2011-06-30 13:04:34

solution3 2 2012-04-25 06:03:02

solution1
5 ACCPTED 2011-06-30 14:13:55

solution2
4 2011-06-30 13:04:34

solution3
2 2012-04-25 06:03:02