使用JAVA（Jsoup）解析html

Question

I am facing a problem in parsing html document using jsoup (Java). 我在使用jsoup（Java）解析html文档时遇到问题。 The HTML I'm parsing has this format: 我正在解析的HTML具有以下格式：

.....
<hr>
  <a name="N1"> </a> Text 1<br>
<hr>
  <a name="N2"> </a> Text 2<br>
<hr>
  <a name="N3"> </a>Text 3<br>
<hr>
  <a name="N4"> </a>
  <DIV style="margin-left: 36px">
   <div></div>
   <img src=bullet.gif alt="Bullet point"> Text
  </DIV><br>
<hr>
 <a name="X5"> </a>
 <DIV style="margin-left: 36px">
  <div></div>
  <img src=bullet.gif alt="Bullet point"> Text
 </DIV><br>
<hr>
  ...

I want to isolate the HTML text between two "hr" tags. 我想在两个“ hr”标签之间隔离HTML文本。 I am trying this code: 我正在尝试此代码：

File input = new File("C:\\Users\\page.html");
Document doc = Jsoup.parse(input, "UTF-8", "http://example.com/");
Elements body = doc.select("body");
Elements hrs = body.select("hr");
ArrayList<String> objects = new ArrayList<String>(); 
for (Element hr : hrs) { 
  String textAfterHr = hr.nextSibling().toString();
  objects.add(textAfterHr);   
}

System.out.println(objects); 的System.out.println（对象）;

but the ArrayList doesn't contain what I want, I don't know how to resolve it. 但是ArrayList不包含我想要的内容，我不知道如何解决它。 (Could I transform the "hr" tag to "hr"text"/hr" tags)? （我可以将“ hr”标签转换为“ hr” text” / hr”标签）吗？

Answer 1

Here you get the result by reading the childrens of each hr tags. 在这里，您可以通过阅读每个hr标签的子元素来获得结果。 Use this for better solution. 使用此更好的解决方案。

ArrayList<String> objects = new ArrayList<String>(); 
Elements hrs = body.select("hr");
for(int i=0;i<hrs.size();i++){
 Element hrElm=hrs.get(i);
 Elements childrens=hrElm.children();
  for(Element child: childrens){
   String text=child.text();
   objects.add(text); 
 }
}

Answer 2

public static void main(String[] args) throws ParseException, IOException {
    String html = ".....\n" +
                    "<hr>\n" +
                    "  <a name=\"N1\"> </a> Text 1<br>\n" +
                    "<hr>\n" +
                    "  <a name=\"N2\"> </a> Text 2<br>\n" +
                    "<hr>\n" +
                    "  <a name=\"N3\"> </a>Text 3<br>\n" +
                    "<hr>\n" +
                    "  <a name=\"N4\"> </a>\n" +
                    "  <DIV style=\"margin-left: 36px\">\n" +
                    "   <div></div>\n" +
                    "   <img src=bullet.gif alt=\"Bullet point\"> Text\n" +
                    "  </DIV><br>\n" +
                    "<hr>\n" +
                    " <a name=\"X5\"> </a>\n" +
                    " <DIV style=\"margin-left: 36px\">\n" +
                    "  <div></div>\n" +
                    "  <img src=bullet.gif alt=\"Bullet point\"> Text\n" +
                    " </DIV><br>\n" +
                    "<hr>\n" +
                    "  ...";
    //Split your html string at each hr tag and keep the delimiter
    String [] splited = (html.split("(?=<hr>)"));
    //join it back to a string using a closing hr tag
    html = String.join("</hr>\n",splited);
    //use the jsoup xmlParser
    Document doc = Jsoup.parse(html,"",Parser.xmlParser());
    Elements eles = doc.select("hr");
    for(Element e : eles){
        System.out.println(e.html());
        System.out.println("-----------------------");
    }
}

使用JAVA（Jsoup）解析html

问题描述

2 个解决方案

解决方案1
0 2017-07-20 06:02:41

解决方案2
0 2017-07-20 10:43:46

使用JAVA（Jsoup）解析html

问题描述

2 个解决方案

解决方案1 0 2017-07-20 06:02:41

解决方案2 0 2017-07-20 10:43:46

解决方案1
0 2017-07-20 06:02:41

解决方案2
0 2017-07-20 10:43:46