简体   繁体   中英

How to get webpage title without downloading all the page source

I'm looking for a method that will allow me to get the title of a webpage and store it as a string.

However all the solutions I have found so far involve downloading the source code for the page, which isn't really practical for a large number of webpages.

The only way I could see would be to limit the length of the string or it only downloads either a set number of chars or stops once it reaches the tag, however this obviously will still be quite large?

Thanks

As the <title> tag is in the HTML itself, there will be no way to not download the file to find "just the title". You should be able download a portion of the file until you've read in the <title> tag, or the </head> tag and then stop, but you'll still need to download (at least a portion of) the file.

This can be accomplished with HttpWebRequest / HttpWebResponse and reading in data from the response stream until we've either read in a <title></title> block, or the </head> tag. I added the </head> tag check because, in valid HTML, the title block must appear within the head block - so, with this check we will never parse the entire file in any case (unless there is no head block, of course).

The following should be able to accomplish this task:

string title = "";
try {
    HttpWebRequest request = (HttpWebRequest.Create(url) as HttpWebRequest);
    HttpWebResponse response = (request.GetResponse() as HttpWebResponse);

    using (Stream stream = response.GetResponseStream()) {
        // compiled regex to check for <title></title> block
        Regex titleCheck = new Regex(@"<title>\s*(.+?)\s*</title>", RegexOptions.Compiled | RegexOptions.IgnoreCase);
        int bytesToRead = 8092;
        byte[] buffer = new byte[bytesToRead];
        string contents = "";
        int length = 0;
        while ((length = stream.Read(buffer, 0, bytesToRead)) > 0) {
            // convert the byte-array to a string and add it to the rest of the
            // contents that have been downloaded so far
            contents += Encoding.UTF8.GetString(buffer, 0, length);

            Match m = titleCheck.Match(contents);
            if (m.Success) {
                // we found a <title></title> match =]
                title = m.Groups[1].Value.ToString();
                break;
            } else if (contents.Contains("</head>")) {
                // reached end of head-block; no title found =[
                break;
            }
        }
    }
} catch (Exception e) {
    Console.WriteLine(e);
}

UPDATE: Updated the original source-example to use a compiled Regex and a using statement for the Stream for better efficiency and maintainability.

A simpler way to handle this would be to download it, then split:

    using System;
    using System.Net.Http;

    private async void getSite(string url)
    {
        HttpClient hc = new HttpClient();
        HttpResponseMessage response = await hc.GetAsync(new Uri(url, UriKind.Absolute));
        string source = await response.Content.ReadAsStringAsync();

        //process the source here

    }

To process the source, you can use the method described here in the article on Getting Content From Between HTML Tags

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM