简体   繁体   中英

Jsoup: How to get all html between 2 header tags

I am trying to get all html between 2 h1 tags. Actual task is to break the html into frames(chapters) based of the h1(heading 1) tags.

Appreciate any help.

Thanks Sunil

If you want to get and process all elements between two consecutive h1 tags you can work on siblings. Here's some example code:

public static void h1s() {
  String html = "<html>" +
  "<head></head>" +
  "<body>" +
  "  <h1>title 1</h1>" +
  "  <p>hello 1</p>" +
  "  <table>" +
  "    <tr>" +
  "      <td>hello</td>" +
  "      <td>world</td>" +
  "      <td>1</td>" +
  "    </tr>" +
  "  </table>" +
  "  <h1>title 2</h1>" +
  "  <p>hello 2</p>" +
  "  <table>" +
  "    <tr>" +
  "      <td>hello</td>" +
  "      <td>world</td>" +
  "      <td>2</td>" +
  "    </tr>" +
  "  </table>" +
  "  <h1>title 3</h1>" +
  "  <p>hello 3</p>" +
  "  <table>" +
  "    <tr>" +
  "      <td>hello</td>" +
  "      <td>world</td>" +
  "      <td>3</td>" +
  "    </tr>" +
  "  </table>" +    
  "</body>" +
  "</html>";
  Document doc = Jsoup.parse(html);
  Element firstH1 = doc.select("h1").first();
  Elements siblings = firstH1.siblingElements();
  List<Element> elementsBetween = new ArrayList<Element>();
  for (int i = 1; i < siblings.size(); i++) {
    Element sibling = siblings.get(i);
    if (! "h1".equals(sibling.tagName()))
      elementsBetween.add(sibling);
    else {
      processElementsBetween(elementsBetween);
      elementsBetween.clear();
    }
  }
  if (! elementsBetween.isEmpty())
    processElementsBetween(elementsBetween);
}

private static void processElementsBetween(
    List<Element> elementsBetween) {
  System.out.println("---");
  for (Element element : elementsBetween) {
    System.out.println(element);
  }
}

I don't know Jsoup that good, but a straight forward approach could look like this:

public class Test {

    public static void main(String[] args){

        Document document = Jsoup.parse("<html><body>" +
            "<h1>First</h1><p>text text text</p>" +
            "<h1>Second</h1>more text" +
            "</body></html>");

        List<List<Node>> articles = new ArrayList<List<Node>>();
        List<Node> currentArticle = null;

        for(Node node : document.getElementsByTag("body").get(0).childNodes()){
            if(node.outerHtml().startsWith("<h1>")){
                currentArticle = new ArrayList<Node>();
                articles.add(currentArticle);
            }

            currentArticle.add(node);
        }

        for(List<Node> article : articles){
            for(Node node : article){
                System.out.println(node);
            }
            System.out.println("------- new page ---------");
        }

    }

}

Do you know the structure of the articles and is it always the same? What do you want to do with the articles? Have you considered splitting them on the client side? This would be an easy jQuery Job.

Iterating over the elements between consecutive <h> elements seems to be fine, except one thing. Text not belonging to any tag, like in <h1/>this<h1/> . To workaround this I implemented splitElemText function to get this text. First split whole parent element using this method. Then except the element, process the suitable entry from the splitted text. Remove calls to htmlToText if you want raw html.

/** Splits the text of the element <code>elem</code> by the children
  * tags.
  * @return An array of size <code>c+1</code>, where <copde>c</code>
  * is the number of child elements.
  * <p>Text after <code>n</code>th element is found in <code>[n+1]</code>.
  */
public static String[] splitElemText(Element elem)
{
  int c = elem.children().size();
  String as[] = new String[c + 1];
  String sAll = elem.html();
  int iBeg = 0;
  int iChild = 0;
  for (Element ch : elem.children()) {
    String sChild = ch.outerHtml();
    int iEnd = sAll.indexOf(sChild, iBeg);
    if (iEnd < 0) { throw new RuntimeException("Tag " + sChild
                    +" not found in its parent: " + sAll);
    }
    as[iChild] = htmlToText(sAll.substring(iBeg, iEnd));
    iBeg = iEnd + sChild.length();
    iChild += 1;
  }
  as[iChild] = htmlToText(sAll.substring(iBeg));
  assert(iChild == c);
  return as;
}

public static String htmlToText(String sHtml)
{
  Document doc = Jsoup.parse(sHtml);
  return doc.text();
}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM