简体   繁体   English

URL连接上的InputStreamReader返回null

[英]InputStreamReader on a URL connection returning null

I am following a tutorial on web scraping from the book "Web Scraping with Java". 我正在阅读《 Java的Web Scraping》一书中有关Web抓取的教程。 The following code gives me a nullPointerExcpetion. 以下代码为我提供了一个nullPointerExcpetion。 Part of the problem is that (line = in.readLine()) is always null, so the while loop at line 33 never runs. 问题的一部分是(line = in.readLine())始终为null,因此第33行的while循环永远不会运行。 I do not know why it is always null however. 我不知道为什么它总是为空。 Can anyone offer me insight into this? 谁能提供我对此的见识? This code should print the first paragraph of the wikipedia article on CPython. 此代码应在CPython上打印Wikipedia文章的第一段。

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import java.net.*;
import java.io.*;

public class WikiScraper {
    public static void main(String[] args) {
        scrapeTopic("/wiki/CPython");
    }
    public static void scrapeTopic(String url){
        String html = getUrl("http://www.wikipedia.org/"+url);
        Document doc = Jsoup.parse(html);
        String contentText = doc.select("#mw-content-text > p").first().text();
        System.out.println(contentText);
    }
    public static String getUrl(String url){
        URL urlObj = null;
        try{
            urlObj = new URL(url);
        }
        catch(MalformedURLException e){
            System.out.println("The url was malformed!");
            return "";
        }
        URLConnection urlCon = null;
        BufferedReader in = null;
        String outputText = "";
        try{
            urlCon = urlObj.openConnection();
            in = new BufferedReader(new InputStreamReader(urlCon.getInputStream()));
            String line = "";
            while((line = in.readLine()) != null){
                outputText += line;
            }
            in.close();
        }catch(IOException e){
            System.out.println("There was an error connecting to the URL");
            return "";
        }
        return outputText;
    }
}

If you enter http://www.wikipedia.org//wiki/CPython in web browser, it will be redirected to https://en.wikipedia.org/wiki/CPython , so 如果您在网络浏览器中输入http://www.wikipedia.org//wiki/CPython ,它将被重定向到https://en.wikipedia.org/wiki/CPython ,因此

use String html = getUrl("https://en.wikipedia.org/"+url); 使用String html = getUrl("https://en.wikipedia.org/"+url);

instead String html = getUrl("http://www.wikipedia.org/"+url); 而是String html = getUrl("http://www.wikipedia.org/"+url);

then line = in.readLine() can really read something. 然后line = in.readLine()可以真正读取内容。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM