简体   繁体   English

C语言解析XML的算法

[英]Algorithm for parsing XML in C

Is there any known algorithm that can detect and separate the tags from an xml txt file and store the content into another file with matching tag details? 是否有任何已知的算法可以检测和分离xml txt文件中的标签并将内容存储到具有匹配标签详细信息的另一个文件中?

I've tried to hard code it but it doesn't work for all tags. 我尝试对其进行硬编码,但它不适用于所有标签。 Tags such as "t" ... "/t" work but tags like "a href="http://example.com"" "/a" don't work. 诸如“ t” ...“ / t”之类的标签有效,但诸如“ a href =” http://example.com“”“ / a”之类的标签无效。

void get_output(){

    int i=0;
    int j=0;
    int k=0;
    int l=0;
    int m=0;
    int n=0;

    printf("\n");

    for(i=0; i<1024; i++){
        k=0;
        for(j=0; j<strlen(tags[i]); j++){

            if(tags[i][j] == '<'||tags[i][j]=='>'){
                k++;
                if(k == 4){
                    for(l = 0; tags[i][l+1] != '>'; l++){
                        printf("%c",tolower(tags[i][l+1]));
                    }

                    printf(": ");

                    for(; tags[i][l+2] != '<'; l++){
                        printf("%c", tags[i][l+2]);
                    }

                    printf("\n");
                }   
            }
        }
    }
}

I'm also trying to avoid using 3rd party libraries too. 我也试图避免使用第三方库。

Your question asks about XML, but you've tagged this as HTML – note that these are rather different beasts. 您的问题是关于XML的,但是您已将其标记为HTML –请注意,它们是完全不同的野兽。

In terms of its syntax, there's nothing special about XML, and you'd parse it just like you would any other syntax; 就其语法而言,XML没有什么特别的,您可以像解析其他语法一样解析它。 there's no special algorithm. 没有特殊的算法。

That is you'd use a lexer such as flex to identify a stream of tokens such as < , </ , = , strings, quotes and so on, and then a parser generator such as bison to write down the syntactical rules, and code on top of that to turn correctly-formed syntax into useful data structures (that is, what does your program actually do when it has discovered an element start-tag such as <a href='urn:foo'> ?). 那就是您将使用诸如flex的词法分析器来识别诸如<</= ,字符串,引号之类的令牌流,然后使用诸如bison类的解析器生成器来写下语法规则和代码最重要的是,将正确格式的语法转换为有用的数据结构(也就是说,程序在发现元素开始标记(例如<a href='urn:foo'>吗?)时实际上会什么?)。 This is perfectly doable, but it's a non-trivial project. 这是完全可行的,但这不是一个简单的项目。

As part of that, you'll acquire a very close relationship to the XML spec , and you'd be well advised to assemble a lot of test-cases, the more pathological the better. 作为其中的一部分,您将获得与XML规范非常紧密的关系,并且强烈建议您组装很多测试用例,病态越多越好。 There's a lot of fine detail, and plenty of subtleties, in that spec. 在那个规格中,有很多精细的细节和很多微妙之处。

A few months ago, I was working on a project which aimed to extract a subset of the content of XML files. 几个月前,我正在进行一个旨在提取XML文件内容子集的项目。 It wasn't a full parse of the file, but I, like you, wanted to keep it simple and avoid third party libraries. 它不是文件的完整解析,但是我和您一样,希望保持简单并避免使用第三方库。 After about a week of solid work, and building on a fair amount of prior experience with yacc/bison-based parsers, I realised that I had actually ended up implementing most of a generic XML parser, which was clearly going to end up reasonably robust and functional, but which was still missing a couple of parsing corner-cases, and was going to be tedious to polish. 经过大约一周的扎实工作,并在基于yacc / bison的解析器的大量先前经验的基础上,我意识到我实际上已经完成了大多数通用XML解析器的实现,显然,该解析器最终将变得相当强大和功能正常,但仍然缺少一些解析的极端情况,并且会很乏味。 I decided that using expat wasn't such a bad idea after all, so threw away my code and made significantly more rapid progress building on that work. 我认为使用expat毕竟不是一个坏主意,因此扔掉了我的代码,并在此工作上取得了明显更快的进步。

Note that parsing well-formed XML is a very different proposition from parsing (often very ill-formed) HTML. 请注意,解析格式正确的XML与解析(通常格式非常不正确)HTML完全不同。 Because HTML barely conforms to a grammar at all, a parser for it would have to be significantly more ad-hoc; 因为HTML根本不符合语法,所以它的解析器必须非常特别。 a bison-generated parser might have considerable difficulties, unless you put some effort into smart error-recovery. 除非您付出一些努力进行智能错误恢复,否则由野牛生成的解析器可能会遇到很大的困难。 You might want to look at a C-based Markdown or Wiki parser for ideas. 您可能需要查看基于C的Markdown或Wiki解析器以获取想法。 Or try googling for tagsoup c for library suggestions (there's a well-known Java parser for wild HTML called TagSoup , and similar things in other languages tend to give it a shout-out). 或尝试使用谷歌搜索tagsoup c以获得库建议(有一个著名的Java解析器,用于野生HTML,称为TagSoup ,其他语言中的类似事物也倾向于对此大喊大叫)。

If doing this without a third-party library is an intellectual exercise, then it'll be a very instructive one, and an ambitious first parser project. 如果在没有第三方库的情况下进行此操作是一种明智的做法,那么它将是非常有启发性的,也是一个雄心勃勃的第一个解析器项目。 If not, then you'd be very well advised to exploit the considerable effort that's gone into existing libraries. 如果没有,那么建议您充分利用现有库中的大量工作。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM