简体   繁体   中英

libxml2 xpath parsing, doesn't work as expected

I decided to use libxml2 parser for my qt application and im stuck on xpath expressions. I found an example class and methods, and modified this a bit for my needs. The code

QStringList* LibXml2Reader::XPathParsing(QXmlInputSource input)
{
    xmlInitParser();

    xmlDocPtr doc;
    xmlXPathContextPtr xpathCtx;
    xmlXPathObjectPtr xpathObj;
    QStringList *valList =NULL;

    QByteArray arr = input.data().toUtf8();  //convert input data to utf8
    int length = arr.length();
    const char* data = arr.data();

    doc = xmlRecoverMemory(data,length); // build a tree, ignoring the errors
    if(doc == NULL) { return NULL;}

    xpathCtx = xmlXPathNewContext(doc); 
    if(xpathCtx == NULL)
    {
        xmlFreeDoc(doc);
        xmlCleanupParser();
        return NULL;
    }

    xpathObj = xmlXPathEvalExpression(BAD_CAST "//[@class='b-domik__nojs']", xpathCtx); //heres the parsing fails
    if(xpathObj == NULL)
    {
        xmlXPathFreeContext(xpathCtx);
        xmlFreeDoc(doc);
        xmlCleanupParser();
        return NULL;
    }

    xmlNodeSetPtr nodes = xpathObj->nodesetval;
    int size = (nodes) ? nodes->nodeNr : 0;
    if(size==0)
    {

        xmlXPathFreeContext(xpathCtx);
        xmlFreeDoc(doc);
        xmlCleanupParser();
        return NULL;
    }
    valList = new QStringList();
    for (int i = 0; i < size; i++)
    {
        xmlNodePtr current = nodes->nodeTab[i];
        const char* str = (const char*)current->content;
        qDebug() << "name: " << QString::fromLocal8Bit((const char*)current->name);
        qDebug() << "content: " << QString::fromLocal8Bit((const char*)current->content) << "\r\n";
        valList->append(QString::fromLocal8Bit(str));
    }

    xmlXPathFreeObject(xpathObj);
    xmlXPathFreeContext(xpathCtx);
    xmlFreeDoc(doc);
    xmlCleanupParser();
    return valList;
}

As an example im making a request to http://yandex.ru/ and trying to get the node with class b-domik__nojs which is basically one div.

xpathObj = xmlXPathEvalExpression(BAD_CAST "//[@class='b-domik__nojs']", xpathCtx); //heres the parsing fails

the problem is the expression //[@class='b-domik__nojs'] doesn't work at all. I checked it in firefox xpath ext., and in opera developer tools xpath ext. in there this expression works perfectly.

I also tried to get other nodes with attributes but for some reason xpath for ANY attribute fails. Is there something wrong in my method? Also when i load a tree using xmlRecover , it gives me a lot of parser errors in debug output.


Ok i played a bit with my libxml2 function more and used "//*" expression to get all elements in the document, but! It returns me only the elements in the first children node of the body tag. This is the yandex.ru dom tree

so basically it gets ALL the elements in the first div "div class="b-line b-line_bar" , but doesnt look for the other elements in other child nodes of the <body> for some reason.

Why can that happen? Maybe xmlParseMemory doesnt build a full tree for some reason? Is there any possible solution to fix this.

Allright it works now, if my mistake was to use xml functions to make html documents into a tree. I used htmlReadMemory and the tree is fully built now. Some code again

xmlInitParser();


xmlDocPtr doc;
xmlXPathContextPtr xpathCtx;
xmlXPathObjectPtr xpathObj;


QByteArray arr = input.data().toUtf8();
int length = arr.length();
const char* data = arr.data();

doc = htmlReadMemory(data,length,"",NULL,HTML_PARSE_RECOVER);

if(doc == NULL) { return NULL;}


xpathCtx = xmlXPathNewContext(doc); 
if(xpathCtx == NULL)
{
    xmlFreeDoc(doc);
    xmlCleanupParser();
    return NULL;
}
xpathObj = xmlXPathEvalExpression(BAD_CAST "//*[@class='b-domik__nojs']", xpathCtx);

etc.

It is really strange that the expression works anywhere, because it is not a valid XPath expression. After the axis specification ( // ), there should be a nodetest (element name or * ) before the predicate (the condition in square brackets).

//*[@class='bdomik__nojs']

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM