简体   繁体   中英

Parsing pages with https in libxml2

I tray Programming a Webcrawler withe libxml2 for my HiWi job. To do so I have to parse also https pages from the web, but is this even possible?

I can already parse a HTML page

    const char *new_url = "http://xmlsoft.org/html/libxml-HTMLparser.html#htmlParserCtxtPtr";
    char buffer [200];
    htmlParserCtxtPtr _ctxtptr = htmlCreateMemoryParserCtxt(buffer,200);
    htmlDocPtr new_page_tree = htmlCtxtReadFile(_ctxtptr, new_url, NULL, 32);

But if I start with "https" for example

https://stackoverflow.com/

I get the warning

I/O warning : failed to load external entity

is it, and if it is, how is it possible to get access to an https page with libxml2 ?

Thank you for your help :)

From the documentation :

To some extent libxml2 provides support for the following additional specifications but doesn't claim to implement them completely:

  • Document Object Model (DOM) http://www.w3.org/TR/DOM-Level-2-Core/ the document model, but it doesn't implement the API itself, gdome2 does this on top of libxml2
  • RFC 959 : libxml2 implements a basic FTP client code
  • RFC 1945 : HTTP/1.0, again a basic HTTP client code
  • SAX: a SAX2 like interface and a minimal SAX1 implementation compatible with early expat versions

There is no indication that it supports HTTPS communication.

You could obtain the HTML page using a proper HTTP(S) client, then pass it to libxml2 for parsing.

(I'm sure it's deliberately ironic that xmlsoft.org's SSL cert is broken!)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM