简体   繁体   English

如何在纯 C 中进行正则表达式字符串替换?

[英]How to do regex string replacements in pure C?

I've looked at the regex functions in the POSIX regex library and the PCRE library, but both of them don't seem to have a string replacement function.我查看了 POSIX regex 库和 PCRE 库中的 regex 函数,但它们似乎都没有字符串替换函数。 I don't want to use C++, and it would be best if I don't need to link another library (but I can if I have to).我不想使用 C++,如果我不需要链接另一个库(但如果必须的话,我可以)。 Do I need to manually do the string replacing?我需要手动进行字符串替换吗? If so, how can I use capture groups?如果是这样,我如何使用捕获组?

regex.h does not provide native support for string replacement, however it does provide subexpressions/capture groups which make it much easier. regex.h 不提供对字符串替换的本机支持,但它确实提供了子表达式/捕获组,这使得它更容易。 I'll assume that you're familiar with regex compilations and skip to regex execution and subexpressions.我假设您熟悉正则表达式编译并跳到正则表达式执行和子表达式。

regexec() is defined as follows in regex.h (/usr/include/): regexec() 在 regex.h (/usr/include/) 中定义如下:

extern int regexec (const regex_t *__restrict __preg,
        const char *__restrict __string, size_t __nmatch,
        regmatch_t __pmatch[__restrict_arr],
        int __eflags);

The first, second, and final arguments are the regex, string to be executed on and execution flags, respectively.第一个、第二个和最后一个参数分别是正则表达式、要执行的字符串和执行标志。 The third and fourth arguments are used to specify an array of regmatch_t's.第三个和第四个参数用于指定 regmatch_t 的数组。 A regmatch_t consists of two fields: rm_so and rm_eo, which are the indices, or offsets, of the beginning and end of the matched area, respectively. regmatch_t 由两个字段组成:rm_so 和 rm_eo,它们分别是匹配区域开始和结束的索引或偏移量。 Theses indices can then be used along with memcpy() , memset() and memmove() from string.h to perform string replacement.然后可以将这些索引与 string.h 中的memcpy()memset()memmove() 一起使用以执行字符串替换。

I'll make a little example and post it later.我会做一个小例子,稍后发布。

Good luck, and I hope that this helped.祝你好运,我希望这会有所帮助。

The PCRE library itself does not provide a replace function, but there is a wrapper function available at the PCRE downloads page that accepts perl style =~ s/pattern/replace/ syntax and then uses the PCRE native functions to do a substitute/replace for you. PCRE 库本身不提供替换函数,但 PCRE 下载页面上有一个包装函数,它接受 perl 样式=~ s/pattern/replace/语法,然后使用 PCRE 原生函数对你。 Go to http://www.pcre.org/ then click on the Download link: ftp://ftp.csx.cam.ac.uk/pub/software/programming/pcre/ , then the Contrib directory.转到http://www.pcre.org/,然后单击下载链接: ftp : //ftp.csx.cam.ac.uk/pub/software/programming/pcre/ ,然后单击Contrib目录。 The package/project you want is: pcrs-0.0.3-src.tar.gz .你想要的包/项目是: pcrs-0.0.3-src.tar.gz

Note that I have not used this myself so I cannot testify as to how well it works.请注意,我自己没有使用过它,因此我无法证明它的工作情况。 It is a fairly small and simple piece of code however, so it may well serve your purpose nicely.然而,这是一段相当小而简单的代码,因此它可能很好地满足您的目的。

/* regex_replace.c
   :w | !gcc % -o .%<
   :w | !gcc % -o .%< && ./.%<
 */
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <regex.h>

void  // *str MUST can be freed, i.e. obtainde by strdup, malloc, ...
regex_replace(char **str, const char *pattern, const char *replace) {
    regex_t reg;
    // if regex can't commpile pattern, do nothing
    if(!regcomp(&reg, pattern, REG_EXTENDED)) {
    size_t nmatch = reg.re_nsub; 
    regmatch_t m[nmatch + 1];
    const char *rpl, *p;
    // count back references in replace
    int br = 0;
    p = replace;
    while(1) { 
        while(*++p > 31); 
        if(*p) br++; 
        else break;
    } // if br is not equal to nmatch, leave
    if(br != nmatch) return;
    // look for matches and replace
    char *new;
    while(!regexec(&reg, *str, nmatch + 1, m, REG_NOTBOL)) {
        // make enough room
        new = (char *)malloc(strlen(*str) + strlen(rpl));
        if(!new) exit(EXIT_FAILURE);
        *new = 0;
        p = rpl = replace;
        int c;
        strncat(new, *str, m[0].rm_so); // test before pattern
        for(int k=0; k<nmatch; k++) {
        while(*++p > 16); // skip printable char
        c = *p;  // back referenc (e.g. \1, \2, ...)
        strncat(new, rpl, p - rpl); // add head of rpl
        // concat match
        strncat(new, *str + m[c].rm_so, m[c].rm_eo - m[c].rm_so);
        rpl = p++; // skip back reference, next match
        }
        strcat(new, p ); // trailing of rpl
        strcat(new, *str + m[0].rm_eo); // trainling text in *str
        free(*str);
        *str = strdup(new);
        free(new);
    }
    // ajust size
    *str = (char *)realloc(*str, strlen(*str) + 1);
    } else
    printf("Could not compile regex: %s\n", replace);
}

int main(int argc, char *argv[]) 
{
    char *pattern = "\\[([^-]+)->([^]]+)\\]";
    char *str = strdup("before [link->address] some text [link2->addr2] trail");
    char rpl[] = "<a href=\"\2\">\1</a>";
    puts(str);
    regex_replace(&str, pattern, rpl);
    puts(str);
    free(str);
}

I've taken the post by @marnout and fixed it up addressing a number of bugs and typos.我已经接受了@marnout 的帖子并修复了它,解决了许多错误和拼写错误。 Fixes:memory leaks, infinite replacement if replacement contains pattern, printing in function replaced with return values, back reference values actually up to 31, documentation, more test examples.修复:内存泄漏,如果替换包含模式,则无限替换,用返回值替换函数中的打印,返回引用值实际上高达 31,文档,更多测试示例。

/* regex_replace.c
:w | !gcc % -o .%<
:w | !gcc % -o .%< && ./.%<
:w | !gcc % -o .%< && valgrind -v ./.%<
*/
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <regex.h>

int regex_replace(char **str, const char *pattern, const char *replace) {
    // replaces regex in pattern with replacement observing capture groups
    // *str MUST be free-able, i.e. obtained by strdup, malloc, ...
    // back references are indicated by char codes 1-31 and none of those chars can be used in the replacement string such as a tab.
    // will not search for matches within replaced text, this will begin searching for the next match after the end of prev match
    // returns:
    //   -1 if pattern cannot be compiled
    //   -2 if count of back references and capture groups don't match
    //   otherwise returns number of matches that were found and replaced
    //
    regex_t reg;
    unsigned int replacements = 0;
    // if regex can't commpile pattern, do nothing
    if(!regcomp(&reg, pattern, REG_EXTENDED)) {
        size_t nmatch = reg.re_nsub;
        regmatch_t m[nmatch + 1];
        const char *rpl, *p;
        // count back references in replace
        int br = 0;
        p = replace;
        while(1) {
            while(*++p > 31);
            if(*p) br++;
            else break;
        } // if br is not equal to nmatch, leave
        if(br != nmatch) {
            regfree(&reg);
            return -2;
        }
        // look for matches and replace
        char *new;
        char *search_start = *str;
        while(!regexec(&reg, search_start, nmatch + 1, m, REG_NOTBOL)) {
            // make enough room
            new = (char *)malloc(strlen(*str) + strlen(replace));
            if(!new) exit(EXIT_FAILURE);
            *new = '\0';
            strncat(new, *str, search_start - *str);
            p = rpl = replace;
            int c;
            strncat(new, search_start, m[0].rm_so); // test before pattern
            for(int k=0; k<nmatch; k++) {
                while(*++p > 31); // skip printable char
                c = *p;  // back reference (e.g. \1, \2, ...)
                strncat(new, rpl, p - rpl); // add head of rpl
                // concat match
                strncat(new, search_start + m[c].rm_so, m[c].rm_eo - m[c].rm_so);
                rpl = p++; // skip back reference, next match
            }
            strcat(new, p ); // trailing of rpl
            unsigned int new_start_offset = strlen(new);
            strcat(new, search_start + m[0].rm_eo); // trailing text in *str
            free(*str);
            *str = (char *)malloc(strlen(new)+1);
            strcpy(*str,new);
            search_start = *str + new_start_offset;
            free(new);
            replacements++;
        }
        regfree(&reg);
        // ajust size
        *str = (char *)realloc(*str, strlen(*str) + 1);
        return replacements;
    } else {
        return -1;
    }
}

const char test1[] = "before [link->address] some text [link2->addr2] trail[a->[b->c]]";
const char *pattern1 = "\\[([^-]+)->([^]]+)\\]";
const char replace1[] = "<a href=\"\2\">\1</a>";

const char test2[] = "abcabcdefghijklmnopqurstuvwxyzabc";
const char *pattern2 = "abc";
const char replace2[] = "!abc";

const char test3[] = "a1a1a1a2ba1";
const char *pattern3 = "a";
const char replace3[] = "aa";
int main(int argc, char *argv[])
{
    char *str1 = (char *)malloc(strlen(test1)+1);
    strcpy(str1,test1);
    puts(str1);
    printf("test 1 Before: [%s], ",str1);
    unsigned int repl_count1 = regex_replace(&str1, pattern1, replace1);
    printf("After replacing %d matches: [%s]\n",repl_count1,str1);
    free(str1);

    char *str2 = (char *)malloc(strlen(test2)+1);
    strcpy(str2,test2);
    puts(str2);
    printf("test 2 Before: [%s], ",str2);
    unsigned int repl_count2 = regex_replace(&str2, pattern2, replace2);
    printf("After replacing %d matches: [%s]\n",repl_count2,str2);
    free(str2);

    char *str3 = (char *)malloc(strlen(test3)+1);
    strcpy(str3,test3);
    puts(str3);
    printf("test 3 Before: [%s], ",str3);
    unsigned int repl_count3 = regex_replace(&str3, pattern3, replace3);
    printf("After replacing %d matches: [%s]\n",repl_count3,str3);
    free(str3);
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM