如何在不下載所有頁面源的情況下獲取網頁標題

Question

我正在尋找一種方法，可以讓我獲得網頁的標題並將其存儲為字符串。

但是到目前為止我找到的所有解決方案都涉及下載頁面的源代碼，這對於大量網頁來說並不實用。

我能看到的唯一方法是限制字符串的長度，或者一旦到達標簽，它只下載一定數量的字符或停止，但這顯然仍然會很大？

謝謝

Answer 1

由於<title>標簽位於HTML本身，因此無法下載文件以找到“只是標題”。 你應該可以下載文件的一部分，直到你讀入<title>標簽或</head>標簽然后停止，但你仍然需要下載（至少一部分）文件。

這可以通過HttpWebRequest / HttpWebResponse完成，並從響應流中讀取數據，直到我們讀入<title></title>塊或</head>標記。 我添加了</head>標簽檢查，因為在有效的HTML中，標題欄必須出現在頭部塊中 - 因此，通過此檢查，我們將永遠不會解析整個文件（當然，除非沒有頭部塊，否則）。

以下應該能夠完成這個任務：

string title = "";
try {
    HttpWebRequest request = (HttpWebRequest.Create(url) as HttpWebRequest);
    HttpWebResponse response = (request.GetResponse() as HttpWebResponse);

    using (Stream stream = response.GetResponseStream()) {
        // compiled regex to check for <title></title> block
        Regex titleCheck = new Regex(@"<title>\s*(.+?)\s*</title>", RegexOptions.Compiled | RegexOptions.IgnoreCase);
        int bytesToRead = 8092;
        byte[] buffer = new byte[bytesToRead];
        string contents = "";
        int length = 0;
        while ((length = stream.Read(buffer, 0, bytesToRead)) > 0) {
            // convert the byte-array to a string and add it to the rest of the
            // contents that have been downloaded so far
            contents += Encoding.UTF8.GetString(buffer, 0, length);

            Match m = titleCheck.Match(contents);
            if (m.Success) {
                // we found a <title></title> match =]
                title = m.Groups[1].Value.ToString();
                break;
            } else if (contents.Contains("</head>")) {
                // reached end of head-block; no title found =[
                break;
            }
        }
    }
} catch (Exception e) {
    Console.WriteLine(e);
}

更新：更新了原始源代碼示例，以便為Stream使用已編譯的Regex和using語句，以提高效率和可維護性。

Answer 2

處理此問題的一種更簡單的方法是下載它，然后拆分：

    using System;
    using System.Net.Http;

    private async void getSite(string url)
    {
        HttpClient hc = new HttpClient();
        HttpResponseMessage response = await hc.GetAsync(new Uri(url, UriKind.Absolute));
        string source = await response.Content.ReadAsStringAsync();

        //process the source here

    }

要處理源，您可以使用“ 從HTML標記之間獲取內容 ”一文中所述的方法

如何在不下載所有頁面源的情況下獲取網頁標題

問題描述

2 個解決方案

解決方案1
16 已采納 2012-07-25 15:29:19

解決方案2
2 2012-10-04 01:43:52

如何在不下載所有頁面源的情況下獲取網頁標題

問題描述

2 個解決方案

解決方案1 16 已采納 2012-07-25 15:29:19

解決方案2 2 2012-10-04 01:43:52

解決方案1
16 已采納 2012-07-25 15:29:19

解決方案2
2 2012-10-04 01:43:52