简体   繁体   English

如何使用Jsoup解析HTMl文件

[英]How to Parse HTMl file using Jsoup

I have a html table & want to extract link text based on certain condition 我有一个html表,并想根据某些条件提取链接文本

<table border="0" cellpadding="3" cellspacing="0" width="100%">
<tbody>
<tr class="dir"><td colspan="2">&nbsp;&nbsp;<a href="http://xyz/">Yogendra sharma</a></td></tr>
<tr>
<td class="f"><a href="abc">abc</a>&nbsp;</td>
<td>
<tt class="con">
<a class="s" href="mno"><span class="l">7</span> mno <b>Hello</b>;</a>
<br>
</tt>
</td></tr>

<tr class="dir"><td colspan="2">&nbsp;&nbsp;<a href="http://xyz/">Yogendra</a></td></tr>
<tr>
<td class="f"><a href="abc">abc</a>&nbsp;</td>
<td>
<tt class="con">
<a class="s" href="mno"><span class="l">7</span> mno <b>Hello</b>;</a>
<br>
</tt>
</td></tr>
</table>

i want to print all first link text ie Yogendra Sharma & Yogendra for html file. 我想为html文件打印所有第一链接文本,即Yogendra Sharma和Yogendra。

this file is huge. 这个文件很大。

i use java with jsoup but cant figger it out. 我将Java与jsoup结合使用,但无法解决。 please help me . 请帮我 。

You can try the below code. 您可以尝试以下代码。 You would need commons-io-1.3.2.jar , jsoup.jar . 您将需要commons-io-1.3.2.jarjsoup.jar Save the html as sample.html in the root folder of project. 将html作为sample.html保存在项目的根文件夹中。

import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;
import java.util.Iterator;

import org.apache.commons.io.IOUtils;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

public class ExtractFromHTML {

    public static void main(String[] args) throws IOException {

        File input = new File("sample.html");

        InputStream in = new FileInputStream(input);

        String htmlOut = IOUtils.toString(in);

        Document document = Jsoup.parse(htmlOut);

        Elements elementsA = document.select("a");

        Iterator<Element> elementIterator = elementsA.iterator();

        while (elementIterator.hasNext()) {
            Element aElement = elementIterator.next();

            if (aElement.outerHtml().contains("http://xyz/")) {
                System.out.println(aElement.text());
            }

        }
    }
}

Output : 输出:

Yogendra sharma
Yogendra

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM