简体   繁体   English

用C语言解析XML和HTML的最佳,高效方法

[英]Best and efficient way to parse both XML and HTML in C

folks! 乡亲们! I'm looking for the best and efficient way to parse server responds that content both HTML and XML stuff. 我正在寻找解析服务器响应该HTML和XML内容的最佳且有效的方法。 The respond come from servers I need to poll each 5 minutes (it's about half a thousand of them in list currently, but it will double very soon). 响应来自服务器,我需要每5分钟轮询一次(当前列表中大约有5,000个服务器,但是很快就会翻倍)。 Respond stored in buffer as plane text (got from socket). 响应存储在缓冲区中作为平面文本(从套接字获取)。 So, I need to parse HTML part and in case of success (mandatory things found) I should then try to parse XML part and get statistics information to store in DB. 因此,我需要解析HTML部分,并在成功的情况下(找到必填项)然后尝试解析XML部分并获取统计信息以存储在DB中。 The responses are like this: 响应如下:

HTTP/1.0 200 OK
Connection: close
Content-Length: 682
Content-Type: text/xml; charset=utf-8
Date: Sun, 09 Mar 2014 15:44:52 GMT
Last-Modified: Sun, 09 Mar 2014 15:44:52 GMT
Server: DrWebAV-DeskServer/REL-610-AV-6.02.0.201311040 Linux/x86_64 Lua/5.1.4 OpenSSL/1.0.0e

<?xml version="1.0" encoding="utf-8"?><avdesk-xml-api API='2.1.0' API_BUILD='20130709' branch='REL-610-AV' oper='get-server-info' rc='true' timestamp='20140309154452987' version='6.02.0.201311040'><server><id>00c1d140-d21d-b211-a828-b62919c4250d</id><platform>Linux 2.6.39-gentoo-r3 x86_64 (4 SMP Mon Oct 24 11:04:40 YEKT 2011)</platform><version>6.02.0.201311040</version><statistics from='20140301000000000' till='20140309235959999'><noviruses/><stations total='101'><online>5</online><deinstalled>21</deinstalled><blocked>0</blocked><expired>81</expired><offline>96</offline><activated>74</activated><unactivated>27</unactivated></stations></statistics></server></avdesk-xml-api>

And could be smth. 并且可能是。 like this 像这样

HTTP/1.0 401 Authorization Required
Cache-Control: post-check=0, pre-check=0
Connection: close
Content-Length: 421
Content-Type: text/html; charset=utf-8
Date: Sun, 09 Mar 2014 15:44:22 GMT
Expires: Date: Sat, 27 Nov 2004 10:18:15 GMT
Last-Modified: Date: Sat, 27 Nov 2004 10:18:15 GMT
Pragma: no-cahe
Server: DrWebAV-DeskServer/REL-610-AV-6.02.0.201311040 Linux/x86_64 Lua/5.1.4 OpenSSL/1.0.1
WWW-Authenticate: Basic realm="Dr.Web XML API area"
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"><HTML><TITLE>Unauthorized</TITLE><BODY><STRONG>Unauthorized</STRONG><P>The error "401 Unauthorized" occured while processing request you had sent.<P><BR><BR><I>Access denied or your browser does not support HTTP authentication!</I><BR><P><BR><BR><HR><P>Dr.Web &reg; AV-Desk Server REL-610-AV 6.02.0.201311040 Linux/x86_64 Lua/5.1.4 OpenSSL/1.0.1</BODY></HTML>

Concerning HTML part I'm basically interested in HTTP/1.0 STRING and Server: STRING stuff, and then need per-tag XML parsing, if authorization succeeded. 关于HTML部分,我基本上对HTTP / 1.0 STRING和Server:STRING感兴趣,如果授权成功,则需要按标签进行XML解析。 I have found, that libxml2 is suitable for parsing both HTML/XML stuff, but unable to find any real examples how to use it, just some major interface description. 我发现libxml2适合解析HTML / XML内容,但是找不到一些真正的示例如何使用它,仅能找到一些主要的接口描述。 So, help needed. 因此,需要帮助。

Code examples for libxml2 are here libxml2代码示例在这里

The mailing list is friendly, and the code is mature and good quality. 邮件列表友好,代码成熟且质量良好。

However, nothing in your example suggests you need to parse HTML. 但是,您的示例中没有任何内容表明您需要解析HTML。 You need to parse (I think) HTTP to process the headers (and detect the 401 error from the HTTP response), then parse the XML content. 您需要解析(我认为)HTTP以处理标头(并从HTTP响应中检测到401错误),然后解析XML内容。 Parsing HTTP headers to the level you require it is trivial (just strtok the response separating on line breaks and the first line has the answer you need). 解析HTTP头,你需要它是微不足道的水平(只是strtok响应分离的换行符和第一行有你需要的答案)。 The body of the response starts after a double line break (I think your second example has a paste error). 响应的主体在两次换行符之后开始(我认为您的第二个示例存在粘贴错误)。 This reduces your task to simply processing HTTP headers and XML (no HTML parsing). 这将您的任务简化为仅处理HTTP标头和XML(无需HTML解析)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM