Parsing pages with https in libxml2

Question

I tray Programming a Webcrawler withe libxml2 for my HiWi job. To do so I have to parse also https pages from the web, but is this even possible?

I can already parse a HTML page

    const char *new_url = "http://xmlsoft.org/html/libxml-HTMLparser.html#htmlParserCtxtPtr";
    char buffer [200];
    htmlParserCtxtPtr _ctxtptr = htmlCreateMemoryParserCtxt(buffer,200);
    htmlDocPtr new_page_tree = htmlCtxtReadFile(_ctxtptr, new_url, NULL, 32);

But if I start with "https" for example

https://stackoverflow.com/

I get the warning

I/O warning : failed to load external entity

is it, and if it is, how is it possible to get access to an https page with libxml2 ?

Thank you for your help :)

Answer 1

From the documentation :

To some extent libxml2 provides support for the following additional specifications but doesn't claim to implement them completely:

Document Object Model (DOM) http://www.w3.org/TR/DOM-Level-2-Core/ the document model, but it doesn't implement the API itself, gdome2 does this on top of libxml2

RFC 959 : libxml2 implements a basic FTP client code

RFC 1945 : HTTP/1.0, again a basic HTTP client code

SAX: a SAX2 like interface and a minimal SAX1 implementation compatible with early expat versions

There is no indication that it supports HTTPS communication.

You could obtain the HTML page using a proper HTTP(S) client, then pass it to libxml2 for parsing.

(I'm sure it's deliberately ironic that xmlsoft.org's SSL cert is broken!)

Parsing pages with https in libxml2

Question

1 answers

solution1
3 2019-08-02 14:29:21

Parsing pages with https in libxml2

Question

1 answers

solution1 3 2019-08-02 14:29:21

solution1
3 2019-08-02 14:29:21