简体   繁体   English

如何使用jsoup解析HTML表格?

[英]How to parse HTML table using jsoup?

I am trying to parse HTML using jsoup.我正在尝试使用 jsoup 解析 HTML。 This is my first time working with jsoup and I read some tutorial on it as well.这是我第一次使用 jsoup,我也阅读了一些关于它的教程。 Below is my HTML table which I am trying to parse -下面是我试图解析的 HTML 表 -

If you see my below table, it has three tr as of now (I have shorten it down to have three table rows just for understanding purpose but in general it will be more).如果你看到我的下表,它现在有三个tr (为了理解目的,我将它缩短为三个表行,但总的来说它会更多)。 Now I would like to extract Cluster Name from my below table and it's corresponding host name so for example - I would extract Titan as cluster name and all its hostname whose status are down.现在我想从我的下表中提取Cluster Name ,它是相应的host name ,例如 - 我将提取Titan作为集群名称及其所有状态为 down 的主机名。

As you can see below for Titan cluster name, I have two hostnames machineA.abc.com and machineB.abc.com in which machineA status is up but machineB status is down .正如您在下面看到的Titan集群名称,我有两个主机名machineA.abc.commachineB.abc.com ,其中machineA status is upmachineB status is down

So I will print out Titan as cluster name and print out machineB.abc.com as the hostname since it is down.因此,我将Titan作为集群名称打印出来,并将machineB.abc.com作为主机名打印出来,因为它已关闭。 Is this possible to do using jsoup?这可以使用 jsoup 吗?

<table border=1>
   <tr>
      <td>&nbsp;</td>
      <td>&nbsp;</td>
      <td>Alert</td>
      <td>Cluster Name</td>
      <td>IP addr</td>
      <td>Host Name</td>
      <td>Type</td>
      <td>Status</td>
      <td>Free</td>
      <td>Version</td>
      <td>Restart Time</td>
      <td>UpTime(Days)</td>
      <td>Last probed</td>
      <td>Last up</td>
   </tr>
   <tr bgcolor="ffffff">
      <td><a href=showlog?ip_addr=127.0.0.1>Hist</a></td>
      <td><a href=http://127.0.0.1:8080/test?full=y>VI</a></td>
      <td bgcolor="ffffff">&nbsp</td>
      <td>Titan</td>
      <td>10.100.111.77</td>
      <td>machineA.abc.com</td>
      <td></td>
      <td bgcolor="ffffff">up</td>
      <td bgcolor="ffffff" align=right>88%</td>
      <td bgcolor="ffffff">2.0.5-SNAPSHOT</td>
      <td bgcolor="ffffff">2014-07-04 01:49:08,220</td>
      <td bgcolor="ffffff" align=right>381</td>
      <td>07-14 20:01:59</td>
      <td>07-14 20:01:59</td>
   </tr>
   <tr bgcolor="ffffff">
      <td><a href=showlog?ip_addr=127.0.0.1>Hist</a></td>
      <td><a href=http://127.0.0.1:8080/test?full=y>VI</a></td>
      <td bgcolor="ffffff">&nbsp</td>
      <td></td>
      <td>10.200.192.99</td>
      <td>machineB.abc.com</td>
      <td></td>
      <td bgcolor="ffffff">down</td>
      <td bgcolor="ffffff" align=right>85%</td>
      <td bgcolor="ffffff">2.0.5-SNAPSHOT</td>
      <td bgcolor="ffffff">2014-07-04 01:52:20,613</td>
      <td bgcolor="ffffff" align=right>103</td>
      <td>07-14 20:01:59</td>
      <td>07-14 20:01:59</td>
   </tr>
</table>

So far, I am able to extract whole HTML table using jsoup but not sure how would I extract cluster name and the hostnames which are down -到目前为止,我能够使用 jsoup 提取整个 HTML 表,但不确定如何提取集群名称和已关闭的主机名 -

URL url = new URL("url_name");
Document doc = Jsoup.parse(url, 3000);

Update:-更新:-

I might have two cluster name in the table as shown below -我可能在表中有两个集群名称,如下所示 -

<table border=1>
   <tr>
      <td>&nbsp;</td>
      <td>&nbsp;</td>
      <td>Alert</td>
      <td>Cluster Name</td>
      <td>IP addr</td>
      <td>Host Name</td>
      <td>Type</td>
      <td>Status</td>
      <td>Free</td>
      <td>Version</td>
      <td>Restart Time</td>
      <td>UpTime(Days)</td>
      <td>Last probed</td>
      <td>Last up</td>
   </tr>
   <tr bgcolor="ffffff">
      <td><a href=showlog?ip_addr=127.0.0.1>Hist</a></td>
      <td><a href=http://127.0.0.1:8080/test?full=y>VI</a></td>
      <td bgcolor="ffffff">&nbsp</td>
      <td>Titan</td>
      <td>10.100.111.77</td>
      <td>machineA.abc.com</td>
      <td></td>
      <td bgcolor="ffffff">up</td>
      <td bgcolor="ffffff" align=right>88%</td>
      <td bgcolor="ffffff">2.0.5-SNAPSHOT</td>
      <td bgcolor="ffffff">2014-07-04 01:49:08,220</td>
      <td bgcolor="ffffff" align=right>381</td>
      <td>07-14 20:01:59</td>
      <td>07-14 20:01:59</td>
   </tr>
   <tr bgcolor="ffffff">
      <td><a href=showlog?ip_addr=127.0.0.1>Hist</a></td>
      <td><a href=http://127.0.0.1:8080/test?full=y>VI</a></td>
      <td bgcolor="ffffff">&nbsp</td>
      <td></td>
      <td>10.200.192.99</td>
      <td>machineB.abc.com</td>
      <td></td>
      <td bgcolor="ffffff">down</td>
      <td bgcolor="ffffff" align=right>85%</td>
      <td bgcolor="ffffff">2.0.5-SNAPSHOT</td>
      <td bgcolor="ffffff">2014-07-04 01:52:20,613</td>
      <td bgcolor="ffffff" align=right>103</td>
      <td>07-14 20:01:59</td>
      <td>07-14 20:01:59</td>
   </tr>
   <tr bgcolor="ffffff">
      <td><a href=showlog?ip_addr=127.0.0.1>Hist</a></td>
      <td><a href=http://127.0.0.1:8080/test?full=y>VI</a></td>
      <td bgcolor="ffffff">&nbsp</td>
      <td>Goldy</td>
      <td>10.100.111.77</td>
      <td>machineH.pqr.com</td>
      <td></td>
      <td bgcolor="ffffff">up</td>
      <td bgcolor="ffffff" align=right>88%</td>
      <td bgcolor="ffffff">2.0.5-SNAPSHOT</td>
      <td bgcolor="ffffff">2014-07-04 01:49:08,220</td>
      <td bgcolor="ffffff" align=right>381</td>
      <td>07-14 20:01:59</td>
      <td>07-14 20:01:59</td>
   </tr>       
</table>

Now if you see above I have two cluster name - one is Titan and other is Goldy so I want to find all the machines which are down for Titan cluster name only.现在,如果您在上面看到我有两个集群名称 - 一个是Titan ,另一个是Goldy所以我只想找到所有因Titan集群名称而停机的机器。

Yes, it is possible with JSoup.是的,JSoup 是可能的。 First, you select the table.首先,您选择表。 Then, you select the <tr> tags for rows.然后,您为行选择<tr>标记。 You can start from the second index since the first row contains only the column names.您可以从第二个索引开始,因为第一行只包含列名。 Then loop over the <th> tags and get the specific index.然后循环遍历<th>标签并获取特定索引。 In your case, the indexes 7 and 5 are important(index 7: Status, index 5: Host Name).在您的情况下,索引 7 和 5 很重要(索引 7:状态,索引 5:主机名)。 Check the status if it equals to down and if it is, then add the Host Name to a list.检查状态是否等于down ,如果是,则将主机名添加到列表中。 That's all.就这样。

ArrayList<String> downServers = new ArrayList<>();
Element table = doc.select("table").get(0); //select the first table.
Elements rows = table.select("tr");

for (int i = 1; i < rows.size(); i++) { //first row is the col names so skip it.
    Element row = rows.get(i);
    Elements cols = row.select("td");

    if (cols.get(7).text().equals("down")) {
        downServers.add(cols.get(5).text());
    }
}

Update: When you find the word Titan you can create another loop and look if the cluster name is empty.更新:当您找到Titan这个词时,您可以创建另一个循环并查看集群名称是否为空。

Edit: I change the while loop to do while loop.编辑:我将while循环改为do while循环。

    ArrayList<String> downServers = new ArrayList<>();
    Element table = doc.select("table").get(0); //select the first table.
    Elements rows = table.select("tr");

    for (int i = 1; i < rows.size(); i++) { //first row is the col names so skip it.
        Element row = rows.get(i);
        Elements cols = row.select("td");

        if (cols.get(3).text().equals("Titan")) {
            if (cols.get(7).text().equals("down"))
                downServers.add(cols.get(5).text());

            do {
                if(i < rows.size() - 1)
                   i++;
                row = rows.get(i);
                cols = row.select("td");
                if (cols.get(7).text().equals("down") && cols.get(3).text().equals("")) {
                    downServers.add(cols.get(5).text());
                }
                if(i == rows.size() - 1)
                    break;
            }
            while (cols.get(3).text().equals(""));
            i--; //if there is two Titan names consecutively.
        }
    }

downServers ArrayList will contain the list of down servers hostnames. downServers ArrayList 将包含停机服务器主机名列表。

What I would do in your case is first create an Object of your machine with all apropriate attributes.在你的情况下我会做的是首先创建一个具有所有适当属性的机器对象。 Then using Jsoup I would extract data and create an ArrayList, and then use logic to get data from the Arraylist.然后使用 Jsoup 我会提取数据并创建一个 ArrayList,然后使用逻辑从 Arraylist 中获取数据。

I am skipping the Object creation (since it is not the issue here) and I will name the Object as Machine我正在跳过对象创建(因为这不是这里的问题),我将对象命名为Machine

Then using Jsoup I would get the row data like this:然后使用 Jsoup 我会得到这样的行数据:

ArrayList<Machine> list = new ArrayList();
Document doc = Jsoup.parse(url, 3000);
for (Element table : doc.select("table")) { //this will work if your doc contains only one table element
  for (Element row : table.select("tr")) {
    Machine tmp = new Machine();
    Elements tds = row.select("td");
    tmp.setClusterName(tds.get(3).text());
    tmp.setIp(tds.get(4).text());
    tmp.setStatus(tds.get(7).text());
    //.... and so on for the rest of attributes
    list.add(tmp);
  }
}

Then use a loop to get the values you need from the list:然后使用循环从列表中获取您需要的值:

for(Machine x:list){
  if(x.getStatus().equalsIgnoreCase("up")){
    //machine with UP status found
    System.out.println("The Machine with up status is:"+x.getHostName());
  }
}

That's all.就这样。 Please also note that this code is not tested and may contain some syntactical errors as it is written directly on this editor and not in an IDE.另请注意,此代码未经测试,可能包含一些语法错误,因为它是直接在此编辑器上而不是在 IDE 中编写的。

The below is a clean generic function to extract an html table into a simple list map structure.下面是一个干净的通用函数,用于将 html 表提取为简单的列表地图结构。

Pass the document to this function with table order asking for the nth table in the html page.将文档以表格顺序传递给此函数,要求在 html 页面中查找第 n 个表格。

The function will not return accurate data if the table makes use of rowspan or colspan.如果表使用 rowspan 或 colspan,该函数将不会返回准确的数据。

public static List<Map<String,String>> parseTable(Document doc, int tableOrder) {
    Element table = doc.select("table").get(tableOrder);
    Elements rows = table.select("tr");
    Elements first = rows.get(0).select("th,td");

    List<String> headers = new ArrayList<String>();
    for(Element header : first)
        headers.add(header.text());

    List<Map<String,String>> listMap = new ArrayList<Map<String,String>>();
    for(int row=1;row<rows.size();row++) {
        Elements colVals = rows.get(row).select("th,td");
        //check column size here

        int colCount = 0;
        Map<String,String> tuple = new HashMap<String,String>();
        for(Element colVal : colVals)
            tuple.put(headers.get(colCount++), colVal.text());
        System.out.println(tuple.toString());
        listMap.add(tuple);
    }
    return listMap;
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM