简体   繁体   English

使用Jsoup在Java中进行HTML解析

[英]Html parsing in Java using Jsoup

I've been using Jsoup for HTML parsing, but I encountered a big problem. 我一直在使用Jsoup进行HTML解析,但是遇到了一个大问题。 It takes too long like 1 hour. 大约需要1个小时。

Here's the site that I am parsing. 这是我正在解析的网站。

<tr>
    <td class="class1">value1 </td>
    <td class="class1">value2</td>
    <td class="class1">value3</td>
    <td class="class1">value4</td>
    <td class="class1">value5 </td>
    <td class="class1">value6</td>
    <td class="class1">value7</td>
    <td class="class1">value8</td>
    <td class="class1">value9</td>
</tr>

In the site, there are thousands of tables like this, and I need to parse them all to a list. 在站点中,有成千上万个这样的表,我需要将它们全部解析为一个列表。 I only need value1 and value6, so to do that I am using this code. 我只需要value1和value6,所以要使用此代码。

Document doc = Jsoup.connect(url).get();
            ls = new LinkedList();
            for(int i = 15; i<doc.text().length(); i++) {//15 because the tables I want starting from 15
                Element element = doc.getElementsByTag("tr").get(i);//table index
                Elements row = element.getElementsByTag("td");
                value6 = row.get(5).text();//getting value6
                value1 = row.get(0).text();//getting value1
                node = new Node(value1, value6);
                ls.insert(node);

As I said it takes too much time, so I need to do it faster. 正如我说的那样,这需要花费太多时间,因此我需要更快地进行处理。 Any ideas how to fix this problem ? 任何想法如何解决此问题?

I think your problem stems from the for loop for(int i = 15; i<doc.text().length(); i++) . 我认为您的问题源于for循环for(int i = 15; i<doc.text().length(); i++) What you do here is loop over the whole text of the document character by character. 您在这里要做的是逐个字符地遍历文档的整个文本。 I highly doubt that this is what you want to do. 我非常怀疑这是您要执行的操作。 I think you want to cycle over the table rows instead. 我认为您想改为遍历表行。 So something like this should work: 所以这样的事情应该工作:

Document doc = Jsoup.connect(url).get();
Elements trs = doc.select("tr");
for (int i =  15; i < trs.size(); i++){
  Element tr = trs.get(i);
  Elements tds = tr.select("td").;
  String value6 = tds.get(5).text(); //getting value6
  String value1 = tds.get(1).text(); //getting value1
  //do whatever you need to do with the values
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM