简体   繁体   English

JSoup从html文件按顺序解析文本和链接

[英]JSoup Parse text and links in sequence from html file

I am trying to extract the text and links from an html file. 我正在尝试从html文件中提取文本和链接。 At the moment i can extract both easily using JSoup but i can only do it seperately. 目前,我可以使用JSoup轻松提取两者,但是我只能单独进行提取。

Here is my code: 这是我的代码:

try {
          doc = (Document) Jsoup.parse(new File(input), "UTF-8");
          Elements paragraphs = ((Element) doc).select("td.text");

          for(Element p : paragraphs){
           // System.out.println(p.text()+ "\r\n" + "***********************************************************" + "\r\n");
            getGui().setTextVers(p.text()+ "\r\n" + "***********************************************************" + "\r\n");

          }
          Elements links = doc.getElementsByTag("a");
          for (Element link : links) {
            String linkHref = link.attr("href");
            String linkText = link.text();
            getGui().setTextVers("\n\n"+link.text() + ">\r\n" +linkHref + "\r\n");
          }
}

I have placed a .text class on the outer most td where there is text. 我将.text类放在最外面的td上,那里有文本。 what i would like to achieve is: When the program finds a td with the .text class it checks it for any links and extracts them from that section in order. 我想要实现的是:当程序找到带有.text类的td时,它将检查是否存在任何链接,并按顺序从该节中提取它们。 So you would have: 因此,您将拥有:

Text 文本

Link 链接

Text 文本

Link 链接

I tried putting an inner for each loop into the first foreach loop but this only printed the full list of links for the page, can anyone help? 我尝试将每个循环的内部内容放入第一个foreach循环,但这仅打印了页面链接的完整列表,任何人都可以帮忙吗?

Try 尝试

Document doc = (Document) Jsoup.parse(new File(input), "UTF-8");
Elements paragraphs = ((Element) doc).select("td.text");

for (Element p : paragraphs) {
    System.out.println(p.text());
    Elements links =  p.getElementsByTag("a");
    for (Element link : links) {
        String linkHref = link.attr("href");
        String linkText = link.text();
        System.out.println("\n\n" + linkText + ">\r\n" + linkHref + "\r\n");
    }
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM