简体   繁体   English

用Java和jsoup进行网站爬取; html无法读取; 空指针异常

[英]Website Scraping with Java and jsoup; html can not be read; nullPointerException

I am trying to scrap data from a website using Java and jsoup. 我正在尝试使用Java和jsoup从网站上抓取数据。 The main aim of my program is to read data out of a table. 我程序的主要目的是从表中读取数据。 Unfortunately, the code works for a simple example table like this . 不幸的是,该代码仅适用于像这样的简单示例表。 But not for others like the one in the code. 但是不适合代码中的其他人。

 import org.jsoup.*;
 import org.jsoup.helper.*;
 import org.jsoup.nodes.*;
 import org.jsoup.select.*;
 import java.io.*; // Only needed if scraping a local File.
 import java.util.*;

 public class Test1 {
    public static void main(String args[]) throws IOException { 
        try{

            Document doc = Jsoup.connect("http://www.truckit.net/freight/details/index/id/62674").timeout(10*1000).get();
            String title = doc.title();

            Element table = doc.getElementById("table");
            Elements rows = table.getElementsByTag("tr");

            for (Element row : rows) {
                Elements tds = row.getElementsByTag("td");
                for (int i = 0; i < tds.size(); i++) {
                    if (i == 1) System.out.println(tds.get(i).text());
                }
            }                           
        }
        catch (java.io.IOException ex) {
            System.out.println("IO Error: " + ex);
            }
    }       
}

The console output is as follows: 控制台输出如下:

Exception in thread "main" java.lang.NullPointerException
at Test1.main(Test1.java:30)

I read a number of threads about nullpointer exceptions but it did not really help me. 我读了一些有关nullpointer异常的线程,但是它并没有真正帮助我。 I know that the variable table = null and thus the variable tr too, but why is that? 我知道变量表= null,因此变量tr也是如此,但是为什么呢? As my program works for other websites, may my problem have to do with the websites html-code? 当我的程序适用于其他网站时,我的问题可能与网站的html代码有关吗?

This is because the page in the link does not have an element which has the attribute id set to "table" 这是因为链接中的页面没有包含属性id设置为"table"的元素

Meaning you'll have to create a different hook for the JSoup to latch onto data. 这意味着您必须为JSoup创建一个不同的钩子以锁存到数据上。

Tables will rarely have the id="table" attribute set since its redundant. 由于冗余,表很少会设置id="table"属性。

Thus generally you are better off with using 因此,通常您最好使用

Elements tables = doc.getElementsByTag("table");

instead of: 代替:

Element table = doc.getElementById("table");

Especially since the page might have multiple tables available (as is the case in the website you mentioned) 尤其是因为该页面可能有多个表可用(就像您提到的网站那样)

Also note that scraping is a case-by-case kind of deal which means that each scraper will have to be tailor-made to a particular website or page, meaning that there is no one-size-fits-all scraper which will work everywhere. 另请注意,抓取是一种逐案处理的交易,这意味着每个刮板都必须针对特定的网站或页面进行量身定制,这意味着没有一种千篇一律的所有刮板都可以在任何地方使用。

Before attempting to scrape data you should examine the structure of the page you will want to scrape (through the view page source option) and then decide which data you'll want to scrape and what's the easiest pathway through the DOM to get it. 在尝试抓取数据之前,您应该检查要抓取的页面的结构(通过“查看页面源”选项),然后确定要抓取的数据以及通过DOM进行获取的最简单途径是什么。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM