简体   繁体   English

regexec获取c中xml标记的值

[英]regexec get value of xml tags in c

I'm trying to get the value of xml tags in c programming by regexec and i cannot use xml parser. 我试图通过regexec在C编程中获取xml标记的值,但是我无法使用xml解析器。

Below is my sample code, can someone help in getting the expected output. 以下是我的示例代码,有人可以帮助获得预期的输出。

char value[500];
regex_t regexp_data;    
regmatch_t matched_data[10];
char pattern_str[] = "<CODE[ \t]*^*>[ \t]*\\(.*\\)[ \t]*<\\/CODE[ \t]*>";
char msg_str[] = "<ROOT><INFO><CODE>5001</CODE><MSG>msg one</MSG></INFO> <INFO><CODE>5002</CODE><MSG>msg two</MSG></INFO></ROOT>";

if ((regcomp(&regexp_data, pattern_str, REG_NEWLINE) == 0) &&
  (regexec(&regexp_data, msg_str, 10, matched_data, 0) == 0))
{
   int i;
   for (i=0; i < 10; ++i)
   {
     memset(value, '\0', sizeof(value));
     memcpy(value, &msg_str[matched_data[i].rm_so], (matched_data[i].rm_eo - matched_data[i].rm_so));

     printf ("value [%s]\n", value);
  }

  regfree(&regexp_data);
}

/*----------------------
Outupt
value [<CODE>5001</CODE><MSG>msg one</MSG></INFO><INFO><CODE>5002</CODE>]
value [5001</CODE><MSG>msg one</MSG></INFO><INFO><CODE>5002]
----------------------
Expected Outupt
value [5001]
value [5002]
----------------------*/

Your regular expression is matching from the first instance of <CODE> to the last instance of </CODE> . 您的正则表达式从<CODE>的第一个实例到</CODE>的最后一个实例匹配。 To help prevent this, you can replace the (.*\\\\) with ([^<]*\\\\) , so your regex is now: 为防止这种情况,可以将(.*\\\\)替换为([^<]*\\\\) ,因此您的正则表达式为:

char pattern_str[] = "<CODE[ \\t]*^*>[ \\t]*\\\\([^<]*\\\\)[ \\t]*<\\\\/CODE[ \\t]*>";

Per Wiktor's comment, .* is too greedy, so I updated the regex to "<CODE[ \\t]*>\\\\s*([0-9]*)\\\\s*<\\\\/CODE[ \\t]*>" and passed in the REG_EXTENDED flag to avoid having to escape the parentheses. 根据Wiktor的评论, .*过于贪婪,因此我将正则表达式更新为"<CODE[ \\t]*>\\\\s*([0-9]*)\\\\s*<\\\\/CODE[ \\t]*>"并传入REG_EXTENDED标志,以避免必须转义括号。

As for capturing multiple matches, you want to follow how the gist Wiktor linked captures multiple matches. 至于捕获多个匹配项,您想遵循链接的要旨如何捕获多个匹配项。 In order to get every match, you have to call regexec on the string multiple times while advancing a pointer to the source string by the length of the entire match. 为了获得每个匹配项,您必须在字符串上多次调用regexec,同时将指向源字符串的指针前进整个匹配项的长度。 The first array element in the array of matches is the entire match, while the subsequent elements are the captured groups. 匹配数组中的第一个数组元素是整个匹配项,而后续元素是捕获的组。 Since you only have one captured group, you only need to pass in a size of 2, not 10. Here's the full code I used: 由于您只有一个捕获的组,因此您只需要传递2个大小,而不是10个。这是我使用的完整代码:

#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <regex.h>

int main() {
    char value[500];
    regex_t regexp_data;
    regmatch_t matched_data[2];
    char pattern_str[] = "<CODE[ \t]*>\\s*([0-9]*)\\s*<\\/CODE[ \t]*>";
    char msg_str[] = "<ROOT><INFO><CODE>5001</CODE><MSG>msg one</MSG></INFO><INFO><CODE>5002</CODE><MSG>msg two</MSG></INFO></ROOT>";
    char *cursor = msg_str;

    if (regcomp(&regexp_data, pattern_str, REG_EXTENDED | REG_NEWLINE) != 0) {
        printf("Couldn't compile.\n");
        return 1;
    } 

    while (regexec(&regexp_data, cursor, 2, matched_data, 0) != REG_NOMATCH) {
        memset(value, '\0', sizeof(value));
        memcpy(value, cursor + matched_data[1].rm_so, (matched_data[1].rm_eo - matched_data[1].rm_so));
        printf("value [%s]\n", value);
        cursor += matched_data[0].rm_eo;
    }

    regfree(&regexp_data);
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM