簡體   English   中英

如何在JSOUP中解析多個html元素?

[英]How to parse multiple html elements in JSOUP?

我正在嘗試從java項目中保存的HTML文檔中的警察局(加爾達是愛爾蘭愛爾蘭警察)解析犯罪統計的簡單html表。 目前,我正在嘗試從html文檔中解析內容並將其打印到控制台。 我遇到的問題是,我只能在表格中打印數字(不包括年份),但是我要達到的目的是從表格中獲得犯罪的名稱,后跟6個數字。

html表格的屏幕截圖(由於我的信譽太低,無法嵌入圖片)

HTML表格

<html><head><meta http-equiv="Content-Type" content="text/html; charset=windows-1252">
<title>Recorded Crime Offences (Number) by Garda Station, Type of Offence and&lt;BR&gt;
Year</title>
</head>
<body>
<table border="">
<tbody><tr align="LEFT">
<th colspan="8">Recorded Crime Offences (Number) by Garda Station, Type of Offence and<br>
Year</th>
</tr>
<tr align="LEFT">
<th colspan="2"> </th>
<th valign="TOP" colspan="1">2011</th>
<th valign="TOP" colspan="1">2012</th>
<th valign="TOP" colspan="1">2013</th>
<th valign="TOP" colspan="1">2014</th>
<th valign="TOP" colspan="1">2015</th>
<th valign="TOP" colspan="1">2016</th>
</tr>
<tr align="RIGHT">
<th align="LEFT" valign="TOP" rowspan="12">Balbriggan, D.M.R. Northern Division</th>
<th align="LEFT">03 ,Attempts/threats to murder, assaults, harassments and related offences</th>
<td>96</td>
<td>89</td>
<td>70</td>
<td>97</td>
<td>103</td>
<td>103</td>
</tr>
<tr align="RIGHT">
<th align="LEFT">04 ,Dangerous or negligent acts</th>
<td>72</td>
<td>67</td>
<td>50</td>
<td>53</td>
<td>45</td>
<td>43</td>
</tr>
<tr align="RIGHT">
<th align="LEFT">05 ,Kidnapping and related offences</th>
<td>0</td>
<td>0</td>
<td>1</td>
<td>3</td>
<td>0</td>
<td>7</td>
</tr>
<tr align="RIGHT">
<th align="LEFT">06 ,Robbery, extortion and hijacking offences</th>
<td>16</td>
<td>19</td>
<td>16</td>
<td>7</td>
<td>11</td>
<td>13</td>
</tr>
<tr align="RIGHT">
<th align="LEFT">07 ,Burglary and related offences</th>
<td>177</td>
<td>190</td>
<td>157</td>
<td>140</td>
<td>151</td>
<td>139</td>
</tr>
<tr align="RIGHT">
<th align="LEFT">08 ,Theft and related offences</th>
<td>510</td>
<td>466</td>
<td>495</td>
<td>542</td>
<td>445</td>
<td>302</td>
</tr>
<tr align="RIGHT">
<th align="LEFT">09 ,Fraud, deception and related offences</th>
<td>66</td>
<td>76</td>
<td>126</td>
<td>114</td>
<td>98</td>
<td>66</td>
</tr>
<tr align="RIGHT">
<th align="LEFT">10 ,Controlled drug offences</th>
<td>113</td>
<td>100</td>
<td>64</td>
<td>55</td>
<td>44</td>
<td>80</td>
</tr>
<tr align="RIGHT">
<th align="LEFT">11 ,Weapons and Explosives Offences</th>
<td>22</td>
<td>18</td>
<td>13</td>
<td>10</td>
<td>19</td>
<td>17</td>
</tr>
<tr align="RIGHT">
<th align="LEFT">12 ,Damage to property and to the environment</th>
<td>257</td>
<td>266</td>
<td>269</td>
<td>203</td>
<td>213</td>
<td>177</td>
</tr>
<tr align="RIGHT">
<th align="LEFT">13 ,Public order and other social code offences</th>
<td>168</td>
<td>115</td>
<td>93</td>
<td>78</td>
<td>79</td>
<td>92</td>
</tr>
<tr align="RIGHT">
<th align="LEFT">15 ,Offences against government, justice procedures and organisation of crime</th>
<td>45</td>
<td>48</td>
<td>39</td>
<td>39</td>
<td>66</td>
<td>50</td>
</tr>
<tr align="LEFT">
<td colspan="8"><a href="http://www.cso.ie/en/methods/crime/recordedcrime/">See Background Notes</a> 
</td>
</tr>
</tbody></table>

</body></html>

我目前想出的代碼可以像這樣打印數字

Figure 0 : 96
Figure 1 : 89
Figure 2 : 70
Figure 3 : 97
Figure 4 : 103
Figure 5 : 103
Figure 6 : 72
Figure 7 : 67
Figure 8 : 50
Figure 9 : 53
Figure 10 : 45
... (Figures 11-66 omitted for conciseness)
Figure 67 : 48
Figure 68 : 39
Figure 69 : 39
Figure 70 : 66
Figure 71 : 50

但是我希望它的顯示方式更像

Crime: 03 ,Attempts/threats to murder, assaults, harassments and related offences
Figure 0 : 96
Figure 1 : 89
Figure 2 : 70
Figure 3 : 97
Figure 4 : 103
Figure 5 : 103

Crime: 04 ,Dangerous or negligent acts
Figure 6 : 72
Figure 7 : 67
Figure 8 : 50
Figure 9 : 53
Figure 10 : 45
etc, etc

我嘗試了多種不同的方法,例如添加一個for循環來訪問帶有犯罪的th元素,然后添加另一個使用數字訪問td元素的方法,但這通常會導致類似

Exception in thread "main" java.lang.IndexOutOfBoundsException: Index: 0, Size: 0  

工作解析器類

import java.io.*;   
import org.jsoup.*; 
import org.jsoup.nodes.Document; 
import org.jsoup.nodes.Element; 
import org.jsoup.select.Elements;

public class ParseCrimeStatistics {

    public static void main(String[]args) {
    try {

        int count = 0;
            File input = new File("Balbriggan.html");
            Document doc =Jsoup.parse(input, "UTF-8", "http://www.cso.ie");

            Elements title = doc.select("td");

                for(Element sectd1:title){
                    Elements ths = sectd1.select("td"); 

                    String result = ths.get(0).text();

                    System.out.println("Figure " + count  + " : "+ result);

                    count++;

    }
    }catch (IOException e) {
        e.printStackTrace();
    }
}
}

有人會對我如何解決這個問題有任何建議嗎? 謝謝。

嘗試這個,

int count = 0;
File input = new File("Balbriggan.html");
Document doc = Jsoup.parse(input, "UTF-8", "http://www.cso.ie");

Elements numbers = doc.select("td");
Elements titles = doc.select("th");


for(int i=9; i<titles.size(); i++)
{
    System.out.println("Crime: " + titles.get(i).text());  
    for(int j=0; j<6; j++)
    {
        System.out.println("Figure " + count + ":" + numbers.get((i-9)*6+j).text());
        count++;
    }
}

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM