简体   繁体   English

如何使用 libcurl 获取网页的标题

[英]How to get title of web page with libcurl

How can I get the title of web page with curl?如何使用 curl 获取网页的标题? I wish to pass http or https url and get title of that page.我希望通过 http 或 https url 并获取该页面的标题。 I figured out that curl_easy_perform(curl) prints the html to terminal but I can't figure out how I can parse the html.我发现curl_easy_perform(curl)将 html 打印到终端,但我不知道如何解析 html。

libcurl is not a HTML parsing library, it's focus is on the transport, ie getting you the bits. libcurl 不是 HTML 解析库,它的重点是传输,即为您提供位。 You need to either interpret them yourself, or turn to other libraries.您需要自己解释它们,或者求助于其他库。

In your case you need to look for the <title> element and extract that element's text.在您的情况下,您需要查找<title>元素并提取该元素的文本。

It's a bit too large to paste here, but this example from libcurl shows how to save content in memory in C. It uses the curl_easy_setopt() function to register a CURLOPT_WRITEFUNCTION callback which receives all the data.在这里粘贴有点太大了,但是libcurl 中的这个示例显示了如何在 C 中将内容保存在内存中。它使用curl_easy_setopt()函数注册一个接收所有数据的CURLOPT_WRITEFUNCTION回调。

Note that the libcurl example uses an "exact-fitting" dynamic string, ie it calls realloc() every time it gets more data.请注意,libcurl 示例使用“精确拟合”动态字符串,即每次获取更多数据时都会调用realloc() This is generally not the best approach but it's of course simple to implement and understand and might make sense in an example.这通常不是最好的方法,但它当然很容易实现和理解,并且在示例中可能有意义。

libcurl doesn't parse html for you. libcurl 不会为您解析 html。 You need to use other libraries for that or write your own parser.您需要为此使用其他库或编写自己的解析器。

Have a look at HTML tidy.看看 HTML tidy。 The Lib curl page has an example . Lib curl 页面有一个示例

If you want just a title you may try a simple solution using std::string search or regular expressions.如果您只想要一个标题,您可以尝试使用 std::string 搜索或正则表达式的简单解决方案。

#include <regex>
#include <unordered_map>
#include <string>
#include <iostream>
#include <curl/curl.h>

//Convert curl out to string
size_t curl_to_string(char* ptr, size_t size, size_t nmemb, void* data)
{
    std::string* str = (std::string*)data;
    int x;

    for (x = 0; x < size * nmemb; ++x)
    {
        (*str) += ptr[x];
    }

    return size * nmemb;
}

std::string curlGetHtmlSource(std::string& link)
{
    CURL* curl;
    CURLcode res;
    std::string html_txt;
    curl = curl_easy_init();
    if (curl) {
        curl_easy_setopt(curl, CURLOPT_URL, &link[0]);
        curl_easy_setopt(curl, CURLOPT_FOLLOWLOCATION, true);
        curl_easy_setopt(curl, CURLOPT_WRITEFUNCTION, curl_to_string);
        curl_easy_setopt(curl, CURLOPT_WRITEDATA, &html_txt);

        /* Perform the request, res will get the return code */
        res = curl_easy_perform(curl);

        /* Check for errors */
        if (res != CURLE_OK)
        {
            fprintf(stderr, "curl_easy_perform() failed: %s\n", curl_easy_strerror(res));
                curl_easy_cleanup(curl);
                throw std::runtime_error("Can't get html source");
        }

        /* always cleanup */
        curl_easy_cleanup(curl);
    }
    
    return html_txt;
}

//Pars regex from text
std::string stringRegex(const std::string& in_string , const std::string& regx)
{
    const std::regex pattern{ regx };

    for (
        std::sregex_iterator it(in_string.begin(), in_string.end(), pattern);
        it != std::sregex_iterator{};
        ++it
        )                                                               
    {
        return (*it)[1];
    }
    return {};
}

//Replace html entities
std::string entityParser(std::string text) {
    std::unordered_map<std::string, std::string> convert({
        {"&quot;", "\""},
        {"&apos;", "'"},
        {"&amp;", "&"},
        {"&gt;", ">"},
        {"&lt;", "<"},
        {"&frasl;", "/"}
        });
    std::string res = "";
    for (int i = 0; i < text.size(); ++i) 
    {
        bool flag = false;
        for (auto it = begin(convert); it != end(convert); ++it) 
        {
            std::string key = it->first;
            std::string value = it->second;
            if (i + key.size() - 1 < text.size()) 
            {
                if (text.substr(i, key.size()) == key) 
                {
                    res += value;
                    i += static_cast<int>(key.size() - 1);
                    flag = true;
                    break;
                }
            }
        }
        if (!flag) {
            res += text[i];
        }
    }
    return res;
}

std::string getTitle(std::string& link)
{
    std::string title = curlGetHtmlSource(link);
    title = stringRegex(title, R"(<title>([^<]*)<)");
    title = entityParser(title);
    return title ;
}

int main()
{
    std::string link = "https://example.com";
    getTitle(link);
}

Sources:资料来源:

curl_to_string - https://stackoverflow.com/a/5525631/17061201 curl_to_string - https://stackoverflow.com/a/5525631/17061201

curlGetHtmlSource- https://curl.se/libcurl/c/simple.html curlGetHtmlSource- https://curl.se/libcurl/c/simple.html

entityParser- https://helloacm.com/a-simple-html-entity-parser-in-c/ entityParser- https://helloacm.com/a-simple-html-entity-parser-in-c/

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM