使用 Java 从网页中提取数据？

Question

我正在尝试在 Java 中制作我的第一个程序。 目标是编写一个浏览网站并为我下载文件的程序。 但是，我不知道如何使用 Java 与互联网进行交互。 谁能告诉我要查找/阅读哪些主题或推荐一些好的资源？

Answer 1

最简单的解决方案（不依赖于任何第三方库或平台）是创建一个 URL 实例，指向您要下载的 web 页面/链接，并使用流读取内容。

例如：

import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.net.URL;
import java.net.URLConnection;


public class DownloadPage {

    public static void main(String[] args) throws IOException {

        // Make a URL to the web page
        URL url = new URL("http://stackoverflow.com/questions/6159118/using-java-to-pull-data-from-a-webpage");

        // Get the input stream through URL Connection
        URLConnection con = url.openConnection();
        InputStream is =con.getInputStream();

        // Once you have the Input Stream, it's just plain old Java IO stuff.

        // For this case, since you are interested in getting plain-text web page
        // I'll use a reader and output the text content to System.out.

        // For binary content, it's better to directly read the bytes from stream and write
        // to the target file.


        BufferedReader br = new BufferedReader(new InputStreamReader(is));

        String line = null;

        // read each line and write to System.out
        while ((line = br.readLine()) != null) {
            System.out.println(line);
        }
    }
}

希望这可以帮助。

Answer 2

基础知识

查看这些以或多或少地从头开始构建解决方案：

从基础开始： Java 教程的网络章节，包括使用 URL
让自己更轻松： Apache HttpComponents （包括 HttpClient）

易于粘合和缝合的东西

您始终可以选择使用exec()和类似方法从 Java 调用外部工具。 例如，您可以使用wget或cURL 。

铁杆的东西

然后，如果您想将 go 变成更成熟的东西，谢天谢地，自动化网络测试的需要为我们提供了非常实用的工具。 看着：

HtmlUnit （强大而简单）
Selenium ,硒-RC
WebDriver/Selenium2 （仍在开发中）
JBehave与JBehave Web

其他一些库是有目的地编写的，考虑到网络抓取：

汤
短途旅行

一些解决方法

Java 是一种语言，也是一种平台，上面运行着许多其他语言。 其中一些集成了出色的语法糖或库以轻松构建刮板。

查看：

Groovy （及其XmlSlurper ）
或Scala （具有出色的 XML 支持，如此处和此处所示）

如果您知道Ruby （ JRuby ，有一篇关于使用 JRuby 和 HtmlUnit 抓取的文章）或Python （ Jython ）或您更喜欢这些语言的优秀库，或者您更喜欢这些语言，那么给他们的 Z18B5A217C4DADE239DDB 一个机会。

一些补充

其他一些类似的问题：

使用 Java 从 HTML 刮取数据
HTML 刮削的选项

Answer 3

这是我使用URL的解决方案，并try with resources短语来捕获异常。

/**
 * Created by mona on 5/27/16.
 */
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.net.MalformedURLException;
import java.net.URL;
public class ReadFromWeb {
    public static void readFromWeb(String webURL) throws IOException {
        URL url = new URL(webURL);
        InputStream is =  url.openStream();
        try( BufferedReader br = new BufferedReader(new InputStreamReader(is))) {
            String line;
            while ((line = br.readLine()) != null) {
                System.out.println(line);
            }
        }
        catch (MalformedURLException e) {
            e.printStackTrace();
            throw new MalformedURLException("URL is malformed!!");
        }
        catch (IOException e) {
            e.printStackTrace();
            throw new IOException();
        }

    }
    public static void main(String[] args) throws IOException {
        String url = "https://madison.craigslist.org/search/sub";
        readFromWeb(url);
    }

}

您还可以根据需要将其保存到文件中，或使用XML或HTML库对其进行解析。

Answer 4

自 Java 11 以来最方便的方法就是使用标准库中的java.net.http.HttpClient 。

例子：

HttpRequest request = HttpRequest.newBuilder(new URI(
    "https://stackoverflow.com/questions/6159118/using-java-to-pull-data-from-a-webpage"))
  .timeout(Duration.of(10, SECONDS))
  .GET()
  .build();

HttpResponse<String> response = HttpClient.newHttpClient()
  .send(request, BodyHandlers.ofString());

if (response.statusCode() != 200) {
  throw new RuntimeException(
    "Invalid response: " + response.statusCode() + ", request: " + response);
}

System.out.println(response.body());

Answer 5

我为我的 API 使用以下代码：

try {
        URL url = new URL("https://stackoverflow.com/questions/6159118/using-java-to-pull-data-from-a-webpage");
        InputStream content = url.openStream();
        int c;
        while ((c = content.read())!=-1) System.out.print((char) c);
    } catch (MalformedURLException e) {
        e.printStackTrace();
    } catch (IOException ie) {
        ie.printStackTrace();
    }

您可以捕获字符并将它们转换为字符串。

使用 Java 从网页中提取数据？

问题描述

5 个解决方案

解决方案1
43 2011-05-28 05:22:29

解决方案2
29 2011-05-28 01:35:29

基础知识

易于粘合和缝合的东西

铁杆的东西

一些解决方法

一些补充

解决方案3
6 2016-05-28 05:29:53

解决方案4
0 2021-02-18 09:01:23

解决方案5
0 2021-04-04 04:14:37

使用 Java 从网页中提取数据？

问题描述

5 个解决方案

解决方案1 43 2011-05-28 05:22:29

解决方案2 29 2011-05-28 01:35:29

基础知识

易于粘合和缝合的东西

铁杆的东西

一些解决方法

一些补充

解决方案3 6 2016-05-28 05:29:53

解决方案4 0 2021-02-18 09:01:23

解决方案5 0 2021-04-04 04:14:37

解决方案1
43 2011-05-28 05:22:29

解决方案2
29 2011-05-28 01:35:29

解决方案3
6 2016-05-28 05:29:53

解决方案4
0 2021-02-18 09:01:23

解决方案5
0 2021-04-04 04:14:37