你如何以編程方式下載 Java 網頁

Question

我希望能夠獲取網頁的 html 並將其保存到String ，以便我可以對其進行一些處理。 另外，我如何處理各種類型的壓縮。

我將如何使用 Java 做到這一點？

Answer 1

我會使用像Jsoup這樣不錯的 HTML 解析器。 然后就這么簡單：

String html = Jsoup.connect("http://stackoverflow.com").get().html();

它完全透明地處理 GZIP 和分塊響應以及字符編碼。 它也提供了更多的優勢，比如 HTML遍歷和 CSS 選擇器操作，就像 jQuery 一樣。 您只需將其作為Document抓取，而不是作為String抓取。

Document document = Jsoup.connect("http://google.com").get();

您真的不想在 HTML 上運行基本的 String 方法甚至正則表達式來處理它。

也可以看看：

Java 中領先的 HTML 解析器的優缺點是什么？

Answer 2

下面是一些使用 Java 的URL類的經過測試的代碼。 不過，我建議在處理異常或將它們傳遞到調用堆棧方面做得比我在這里做得更好。

public static void main(String[] args) {
    URL url;
    InputStream is = null;
    BufferedReader br;
    String line;

    try {
        url = new URL("http://stackoverflow.com/");
        is = url.openStream();  // throws an IOException
        br = new BufferedReader(new InputStreamReader(is));

        while ((line = br.readLine()) != null) {
            System.out.println(line);
        }
    } catch (MalformedURLException mue) {
         mue.printStackTrace();
    } catch (IOException ioe) {
         ioe.printStackTrace();
    } finally {
        try {
            if (is != null) is.close();
        } catch (IOException ioe) {
            // nothing to see here
        }
    }
}

Answer 3

比爾的回答非常好，但您可能想對請求做一些事情，例如壓縮或用戶代理。 以下代碼顯示了如何對請求進行各種類型的壓縮。

URL url = new URL(urlStr);
HttpURLConnection conn = (HttpURLConnection) url.openConnection(); // Cast shouldn't fail
HttpURLConnection.setFollowRedirects(true);
// allow both GZip and Deflate (ZLib) encodings
conn.setRequestProperty("Accept-Encoding", "gzip, deflate");
String encoding = conn.getContentEncoding();
InputStream inStr = null;

// create the appropriate stream wrapper based on
// the encoding type
if (encoding != null && encoding.equalsIgnoreCase("gzip")) {
    inStr = new GZIPInputStream(conn.getInputStream());
} else if (encoding != null && encoding.equalsIgnoreCase("deflate")) {
    inStr = new InflaterInputStream(conn.getInputStream(),
      new Inflater(true));
} else {
    inStr = conn.getInputStream();
}

要設置用戶代理，請添加以下代碼：

conn.setRequestProperty ( "User-agent", "my agent name");

Answer 4

好吧，您可以使用URL和URLConnection等內置庫，但它們並沒有提供太多控制權。

~~我個人會使用Apache HTTPClient庫。~~
編輯： HTTPClient 已被 Apache 設置為生命周期結束。 替代品是： HTTP 組件

Answer 5

上面提到的所有方法都不會下載在瀏覽器中看起來的網頁文本。 如今，大量數據通過 html 頁面中的腳本加載到瀏覽器中。 上述技術均不支持腳本，它們僅下載 html 文本。 HTMLUNIT 支持 javascripts。 因此，如果您希望下載在瀏覽器中顯示的網頁文本，那么您應該使用HTMLUNIT 。

Answer 6

您很可能需要從安全網頁（https 協議）中提取代碼。 在下面的例子中，html 文件被保存到 c:\\temp\\filename.html 享受！

import java.io.BufferedReader;
import java.io.BufferedWriter;
import java.io.FileWriter;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.net.URL;

import javax.net.ssl.HttpsURLConnection;

/**
 * <b>Get the Html source from the secure url </b>
 */
public class HttpsClientUtil {
    public static void main(String[] args) throws Exception {
        String httpsURL = "https://stackoverflow.com";
        String FILENAME = "c:\\temp\\filename.html";
        BufferedWriter bw = new BufferedWriter(new FileWriter(FILENAME));
        URL myurl = new URL(httpsURL);
        HttpsURLConnection con = (HttpsURLConnection) myurl.openConnection();
        con.setRequestProperty ( "User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:63.0) Gecko/20100101 Firefox/63.0" );
        InputStream ins = con.getInputStream();
        InputStreamReader isr = new InputStreamReader(ins, "Windows-1252");
        BufferedReader in = new BufferedReader(isr);
        String inputLine;

        // Write each line into the file
        while ((inputLine = in.readLine()) != null) {
            System.out.println(inputLine);
            bw.write(inputLine);
        }
        in.close(); 
        bw.close();
    }
}

Answer 7

為此，請使用 NIO.2 強大的 Files.copy(InputStream in, Path target)：

URL url = new URL( "http://download.me/" );
Files.copy( url.openStream(), Paths.get("downloaded.html" ) );

Answer 8

在 Unix/Linux 機器上，您可以只運行“wget”，但如果您正在編寫跨平台客戶端，這不是一個真正的選擇。 當然，這假設您真的不想在下載數據和它到達磁盤之間對下載的數據做太多事情。

Answer 9

Jetty 有一個 HTTP 客戶端，可用於下載網頁。

package com.zetcode;

import org.eclipse.jetty.client.HttpClient;
import org.eclipse.jetty.client.api.ContentResponse;

public class ReadWebPageEx5 {

    public static void main(String[] args) throws Exception {

        HttpClient client = null;

        try {

            client = new HttpClient();
            client.start();
            
            String url = "http://example.com";

            ContentResponse res = client.GET(url);

            System.out.println(res.getContentAsString());

        } finally {

            if (client != null) {

                client.stop();
            }
        }
    }
}

該示例打印一個簡單網頁的內容。

在閱讀 Java 網頁教程中，我編寫了六個使用 URL、JSoup、HtmlCleaner、Apache HttpClient、Jetty HttpClient 和 HtmlUnit 在 Java 中以編程方式下載網頁的示例。

Answer 10

從這個類獲取幫助它獲取代碼並過濾一些信息。

public class MainActivity extends AppCompatActivity {

    EditText url;
    @Override
    protected void onCreate(Bundle savedInstanceState) {
        super.onCreate( savedInstanceState );
        setContentView( R.layout.activity_main );

        url = ((EditText)findViewById( R.id.editText));
        DownloadCode obj = new DownloadCode();

        try {
            String des=" ";

            String tag1= "<div class=\"description\">";
            String l = obj.execute( "http://www.nu.edu.pk/Campus/Chiniot-Faisalabad/Faculty" ).get();

            url.setText( l );
            url.setText( " " );

            String[] t1 = l.split(tag1);
            String[] t2 = t1[0].split( "</div>" );
            url.setText( t2[0] );

        }
        catch (Exception e)
        {
            Toast.makeText( this,e.toString(),Toast.LENGTH_SHORT ).show();
        }

    }
                                        // input, extrafunctionrunparallel, output
    class DownloadCode extends AsyncTask<String,Void,String>
    {
        @Override
        protected String doInBackground(String... WebAddress) // string of webAddress separate by ','
        {
            String htmlcontent = " ";
            try {
                URL url = new URL( WebAddress[0] );
                HttpURLConnection c = (HttpURLConnection) url.openConnection();
                c.connect();
                InputStream input = c.getInputStream();
                int data;
                InputStreamReader reader = new InputStreamReader( input );

                data = reader.read();

                while (data != -1)
                {
                    char content = (char) data;
                    htmlcontent+=content;
                    data = reader.read();
                }
            }
            catch (Exception e)
            {
                Log.i("Status : ",e.toString());
            }
            return htmlcontent;
        }
    }
}

Answer 11

我使用了這篇文章的實際答案 ( url ) 並將輸出寫入文件。

package test;

import java.net.*;
import java.io.*;

public class PDFTest {
    public static void main(String[] args) throws Exception {
    try {
        URL oracle = new URL("http://www.fetagracollege.org");
        BufferedReader in = new BufferedReader(new InputStreamReader(oracle.openStream()));

        String fileName = "D:\\a_01\\output.txt";

        PrintWriter writer = new PrintWriter(fileName, "UTF-8");
        OutputStream outputStream = new FileOutputStream(fileName);
        String inputLine;

        while ((inputLine = in.readLine()) != null) {
            System.out.println(inputLine);
            writer.println(inputLine);
        }
        in.close();
        } catch(Exception e) {

        }

    }
}

你如何以編程方式下載 Java 網頁

問題描述

11 個解決方案

解決方案1
178 2010-12-31 17:57:30

也可以看看：

解決方案2
112 已采納 2008-10-26 21:09:39

解決方案3
26 2010-04-06 05:17:00

解決方案4
13 2008-10-26 20:20:45

解決方案5
9 2014-05-30 10:30:16

解決方案6
2 2018-10-27 17:55:52

解決方案7
1 2020-06-15 19:23:00

解決方案8
0 2008-10-26 20:43:45

解決方案9
0 2016-08-18 16:42:58

解決方案10
0 2017-12-16 17:23:19

解決方案11
-1 2017-10-26 08:42:30

你如何以編程方式下載 Java 網頁

問題描述

11 個解決方案

解決方案1 178 2010-12-31 17:57:30

也可以看看：

解決方案2 112 已采納 2008-10-26 21:09:39

解決方案3 26 2010-04-06 05:17:00

解決方案4 13 2008-10-26 20:20:45

解決方案5 9 2014-05-30 10:30:16

解決方案6 2 2018-10-27 17:55:52

解決方案7 1 2020-06-15 19:23:00

解決方案8 0 2008-10-26 20:43:45

解決方案9 0 2016-08-18 16:42:58

解決方案10 0 2017-12-16 17:23:19

解決方案11 -1 2017-10-26 08:42:30

解決方案1
178 2010-12-31 17:57:30

解決方案2
112 已采納 2008-10-26 21:09:39

解決方案3
26 2010-04-06 05:17:00

解決方案4
13 2008-10-26 20:20:45

解決方案5
9 2014-05-30 10:30:16

解決方案6
2 2018-10-27 17:55:52

解決方案7
1 2020-06-15 19:23:00

解決方案8
0 2008-10-26 20:43:45

解決方案9
0 2016-08-18 16:42:58

解決方案10
0 2017-12-16 17:23:19

解決方案11
-1 2017-10-26 08:42:30