简体   繁体   English

基于正则表达式的 strstr function in C

[英]Regex-based strstr function in C

I needed to find a way to get a pointer to a substring (like strstr, first occurence) to more than one possible needles (patterns) a large string.我需要找到一种方法来获取指向 substring 的指针(如 strstr,第一次出现)到多个可能的针(模式)大字符串。 C's standard strstr() does only support one needle, I need 2 needles or even 3. Why all this? C 的标准strstr()只支持一根针,我需要 2 根甚至 3 根针。为什么这一切? I need to be able to "tokenize" a html document into parts to further parse these "snippets".我需要能够将 html 文档“标记”成多个部分,以进一步解析这些“片段”。 The "anchor" I need for tokenizing can vary, for example <div class="blub"> or <span id="bla and the html tags to be used as an token could contain numbers in the id/class attribute values (there for I could use \d+ or such to filter).我需要用于标记化的“锚”可能会有所不同,例如<div class="blub"><span id="bla以及用作标记的 html 标记可能在 id/class 属性值中包含数字(有因为我可以使用\d+或类似的过滤器)。

So I thought to write a function in using posix regex.所以我想在使用 posix 正则表达式时写一个 function 。

The function looks like this: function 看起来像这样:

char * reg_strstr(const char *str, const char *pattern) {
    char *result = NULL;
    regex_t re;
    regmatch_t match[REG_MATCH_SIZE];

    if (str == NULL)
        return NULL;

    if (regcomp( &re, pattern, REG_ICASE | REG_EXTENDED) != 0) {
        regfree( &re );         
        return NULL;
    }

    if (!regexec(&re, str, (size_t) REG_MATCH_SIZE, match, 0)) {

        fprintf( stdout, "Match from %2d to %2d: \"%s\"\n",
             match[0].rm_so,
             match[0].rm_eo,
             str + match[0].rm_so);
        fflush(stdout);

        if ((str + match[0].rm_so) != NULL) {
            result = strndup(str + match[0].rm_so, strlen(str + match[0].rm_so));
        }
    }

    regfree( &re );

    return result;
}

The constant REG_MATCH_SIZE is 10常量 REG_MATCH_SIZE 是 10

First of all, does that idea using regex as an extended strstr function make sense at all?首先,使用正则表达式作为扩展 strstr function 的想法是否有意义?

In simple test cases that function seem to work fine:在 function 似乎工作正常的简单测试用例中:

char *str_result = reg_strstr("<tr class=\"i10\"><td><div class=\"xyz\"><!--DDDD-1234--><div class=\"xx21\">", "<div class=\"xyz\">|<div class=\"i10 rr");

printf( "\n\n"
    "reg_strstr result: '%s' ..\n", str_result);

free( str_result) ;

Using that function in a in a real case environment using a complete HTML document does not to work like expected.在使用完整 HTML 文档的真实案例环境中使用该 function 不会像预期的那样工作。 It does not find the pattern.它没有找到模式。 Using this function on a memory mapped string (I use a mmap'ed file as a cache for tmp. storage while parsing HTML document data).在 memory 映射字符串上使用此 function (我在解析 HTML 文档数据时使用 mmap'ed 文件作为 tmp. 存储的缓存)。

EDIT:编辑:

Here in a loop like used:在这样的循环中使用:

Variables: parse_tag->firsttoken and parse_tag->nexttoken are the html anchors I try to match, just like illustrated above.变量: parse_tag->firsttokenparse_tag->nexttoken是我尝试匹配的 html 锚点,如上图所示。 doc is the input document, from the mmap'ed cache an allocated and '\0' terminated string (with strndup() ). doc 是输入文档,来自 mmap'ed 缓存的分配和 '\0' 终止的字符串(使用strndup() )。 Code below works with strstr() as expected.下面的代码按预期与strstr()一起使用。 If I find out, the idea using regex strstr really work for me I can rewrite the loop and maybe return all matches from reg_strstr (as an stringlist or such).如果我发现,使用正则表达式 strstr 的想法对我来说真的很有效,我可以重写循环并可能从 reg_strstr 返回所有匹配项(作为字符串列表等)。 So for now I am just trying...所以现在我只是在尝试...



...
char *tokfrom = NULL, *tokto = NULL;
char *listend = NULL;

/* first token found ? */ if ((tokfrom = strstr(doc, parse_tag->firsttoken)) != NULL) { /* is skipto_nexttoken set ? */ if (!parse_tag->skipto_nexttoken) tokfrom += strlen(parse_tag->firsttoken); else { /* ignore string between firsttoken and first nexttoken */ if ((tokfrom = strstr(tokfrom, parse_tag->nexttoken)) == NULL) goto end_parse; }

/* no listend tag found ? */
if (parse_tag->listend == NULL ||
    (listend = reg_strstr(tokfrom, parse_tag->listend)) == NULL) {
    listend = doc + strlen(doc);
}

*listend = '\0';        /* truncate */

do {
    if((tokto = reg_strstr(tokfrom + 1, parse_tag->nexttoken)) == NULL)
        tokto = listend;
    tokto--;  /* tokto-- : this token up to nexttoken */

    if (tokto <= tokfrom)
        break;

    /* do some filtering with current token here ... */
    /* ... */

} while ((tokfrom = tokto + 1) < listend);

} ...

EDIT END编辑结束

Do I miss something here?我在这里想念什么吗? Like said, is this possible at all what I try to accomplish?就像说的那样,这可能是我试图完成的吗? Is the regex pattern errornous?正则表达式模式是否错误?

Suggestions are welcome!欢迎提出建议!

Andreas安德烈亚斯

I tried you code on a test HTML file that I simply input from a text-file through stdin via redirection, and that seemed to work just fine with repeated reads to fgets() .我在测试stdin文件上尝试了你的代码,我只是通过重定向通过标准输入从文本文件输入,并且重复读取fgets()似乎工作得很好。 I would then suspect that the issue is somewhere in the formatting of the string-data in you memory-mapped file.然后我会怀疑问题出在内存映射文件中字符串数据的格式中。 My suspicion is that there is a null-terminating character somewhere in your memory mapped file, such that if you are simply using the memory mapped file itself as a char buffer, it is ending the string far earlier than you were expecting.我怀疑 memory 映射文件中的某处有一个空终止字符,因此如果您只是将 memory 映射文件本身用作字符缓冲区,它会比您预期的更早结束字符串。

Secondly, you are only returning the first match plus the rest of the string, which would mean the entire file from the first match onwards if you're using the pointer to the memory mapped file as your str parameter.其次,您只返回第一个匹配项加上字符串的 rest,这意味着如果您使用指向 memory 映射文件的指针作为str参数,则从第一个匹配项开始的整个文件。 I'm suspecting your "real" implementation is a bit different if you're looking to tokenize the file?如果您要标记文件,我怀疑您的“真实”实现有点不同?


EDIT编辑

I've been looking at your concept code, and it does seem to be working overall.我一直在查看您的概念代码,它似乎确实在整体上工作。 I made a couple modifications just to help me print things out, but here is what I'm compiling (very down-and-dirty for the file-memory-mapping just to check if the regex code is working):我做了一些修改只是为了帮助我打印出来,但这是我正在编译的内容(文件内存映射非常糟糕,只是为了检查正则表达式代码是否正常工作):

#include <regex.h>
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <sys/mman.h>
#include <unistd.h>
#include <fcntl.h>

#define REG_MATCH_SIZE 10
#define FILE_SIZE 60000

static int total_matches = 0;

char * reg_strstr(const char *str, const char *pattern) 
{
    char *result = NULL;
    regex_t re;
    regmatch_t match[REG_MATCH_SIZE];

    if (str == NULL)
        return NULL;

    if (regcomp( &re, pattern, REG_ICASE | REG_EXTENDED) != 0) {
        regfree( &re );         
        return NULL;
    }

    if (!regexec(&re, str, (size_t) REG_MATCH_SIZE, match, 0)) {

        fprintf( stderr, "@@@@@@ Match from %2d to %2d @@@@@@@@@\n",
             match[0].rm_so,
             match[0].rm_eo);

    total_matches++;

        if ((str + match[0].rm_so) != NULL) {
            result = strndup(str + match[0].rm_so, strlen(str + match[0].rm_so));
        }
    }

    regfree( &re );

    return result;
}


int main()
{   
    int filedes = open("testhtml.txt", O_RDONLY);

    void* buffer = mmap(NULL, FILE_SIZE, PROT_READ, MAP_PRIVATE, filedes, 0); 

    char* str_result;
    char* temp_buffer = strdup((char*)buffer);
    while(str_result = reg_strstr(temp_buffer, "<div"))
    {
        char* temp_print = strndup(str_result, 30);
        fprintf(stderr, "reg_strstr result: '%s' ..\n\n", temp_print);
        free(temp_print);
        free(temp_buffer);
        temp_buffer = strdup(str_result + 1);
        free( str_result) ;
    }

    fprintf(stderr, "Total Matches: %d\n", total_matches);

    return 0;
}

Just using the simple match for "<div" , if I run it on the entire HTML source for a page like this one here at Bloomberg , I get a total of 87 matches, and I get something that is equivalent to what you would get with a repeated call to the standard strstr() .仅使用"<div"的简单匹配,如果我在整个 HTML 源上运行它,在Bloomberg的这个页面上运行它,我总共得到 87 个匹配,我得到的东西与你会得到的一样重复调用标准strstr() For instance, sample output looks like (note: I've cut off the match on the return string after 30 characters for sanity's sake):例如,示例 output 看起来像(注意:为了理智,我在 30 个字符后切断了返回字符串的匹配):

@@@@@@ Match from 5321 to 5325 @@@@@@@@@
reg_strstr result: '<div id="noir_dialog" class="p' ..

@@@@@@ Match from 362 to 366 @@@@@@@@@
reg_strstr result: '<div id="container" class="mod' ..

The matching indexes change of course since the new input string is shorter than the previous input string, so that's why you see a match that starts at 5321, but then the next match is at 362... the overall offset would be at 5683 in the original string.匹配索引当然会发生变化,因为新的输入字符串比以前的输入字符串短,所以这就是为什么你会看到一个从 5321 开始的匹配,但下一个匹配是在 362 ......总偏移量将是 5683 in原始字符串。 With a different regular expression I'm sure you would get different results, but overall it seems that your concept is working, or at least is working like strstr() would work, that is it's returning the entire string starting at the match to the substring all the way to the end of the string.使用不同的正则表达式,我相信您会得到不同的结果,但总体而言,您的概念似乎是有效的,或者至少像strstr()一样有效,也就是说,它将从匹配开始的整个字符串返回到substring 一直到字符串的末尾。

If you're not getting the results you're expecting (I'm not sure exactly what you're not getting), then I would say the problem is either in the regular expression itself, or in the loop, in that you may have your indexing off (ie, with totok-- you could be creating a loop for yourself that simply keeps returning a match at the same point in the string).如果您没有得到您期望的结果(我不确定您到底没有得到什么),那么我会说问题出在正则表达式本身或循环中,因为您可能关闭您的索引(即,使用totok--您可以为自己创建一个循环,该循环只是在字符串中的同一点不断返回匹配项)。

Make sure the data you are loading is null-terminated.确保您正在加载的数据以空值结尾。 Arg 2 of regexec must be a null-terminated string. regexec 的 Arg 2 必须是一个以 null 结尾的字符串。

Why don't you simply use a finite state machine which processes one character at a time and parses HTML tags.为什么不简单地使用有限的 state 机器,它一次处理一个字符并解析 HTML 标签。 This way, you'll also get rid of the problem of HTML comments.这样,您也将摆脱 HTML 注释的问题。 Think of the following cases:考虑以下情况:

Anchor tag in HTML comment: HTML 注释中的锚标记:

<!-- <my anchortag="foo"> -->

Comment in HTML attribute: HTML 属性中的注释:

<some tag="<!--"> <my anchortag="foo"> </some tag="-->">

With regular expressions, you'll have a hard time dealing with these cases.使用正则表达式,您将很难处理这些情况。

Some people, when confronted with a problem, think "I know, I'll use regular expressions."有些人在遇到问题时会想“我知道,我会使用正则表达式”。 Now they have two problems.现在他们有两个问题。 (Jamie Zawinski) (杰米·扎温斯基)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM