简体   繁体   English

“模式匹配”和 C 中的提取

[英]"Pattern matching" and extracting in C

I need to parse a lot of filenames (up to 250000 I guess), including the path, and extract some parts out of it.我需要解析很多文件名(我猜最多可达 250000),包括路径,并从中提取一些部分。

Here is an example:下面是一个例子:

Original: /my/complete/path/to/80/01/a9/1d.pdf原文: /my/complete/path/to/80/01/a9/1d.pdf

Needed: 8001a91d需要: 8001a91d

The "pattern" I am looking for will always begin with "/8".我正在寻找的“模式”总是以“/8”开头。 The parts I need to extract form an 8 hex-digits string.我需要从 8 个十六进制数字字符串中提取的部分。

My idea is the following (simplyfied for demonstration):我的想法如下(简化演示):

/* original argument */
char *path = "/my/complete/path/to/80/01/a9/1d.pdf";

/* pointer to substring */
char *begin = NULL;

/* final char array to be build */
char *hex = (char*)malloc(9);

/* find "pattern" */
begin = strstr(path, "/8");
if(begin == NULL)
    return 1;

/* jump to first needed character */
begin++;

/* copy the needed characters to target char array */
strncpy(hex,   begin,   2);
strncpy(hex+2, begin+3, 2);
strncpy(hex+4, begin+6, 2);
strncpy(hex+6, begin+9, 2);
strncpy(hex+8, "\0",    1);     

/* print final char array */
printf("%s\n", hex);

This works.这有效。 I just have the feeling it is not the most clever way.我只是觉得这不是最聪明的方法。 And that there might be some traps I don't see myself.并且可能有一些我自己看不到的陷阱。

So, does someone have suggestions what could be dangerous with this pointer-shifting manner?那么,有人建议这种指针移动方式有什么危险吗? What would be an improvement in your opinion?您认为有哪些改进?

Does C provide a functionality to do it like so s|/(8.)/(..)/(..)/(..)\\.|\\1\\2\\3\\4| C 是否提供了这样的功能s|/(8.)/(..)/(..)/(..)\\.|\\1\\2\\3\\4| ? ? If I remember right some scripting languages have a feature like that;如果我没记错的话,一些脚本语言有这样的功能; if you know what I mean.如果你明白我的意思。

C itself doesn't provide this, but you can use POSIX regex. C 本身不提供此功能,但您可以使用 POSIX 正则表达式。 It's a full-featured regular expression library.它是一个功能齐全的正则表达式库。 But for a pattern as simple as yours, this probably is the best way.但是对于像您这样简单的模式,这可能是最好的方法。

BTW, prefer memcpy to strncpy .顺便说一句,比strncpy更喜欢memcpy Very few people know what strncpy is good for.很少有人知道strncpy有什么好处。 And I'm not one of them.而我不是其中之一。

In the simple case of just matching /8./../../.. I'd personally go for the strstr() solution myself (no external dependency required).在仅匹配/8./../../..的简单情况下,我个人会自己使用strstr()解决方案(不需要外部依赖)。 If the rules become more though, you could try a lexer ( flex and friends), they support regular expressions.如果规则变得更多,您可以尝试使用词法分析器( flex和朋友),它们支持正则表达式。

In your case something like this:在你的情况下是这样的:

h2           [0-9A-Fa-f]{2}
mymatch      (/{h2}){4}

could work.可以工作。 You'd have to set buffers to the match by side effect though as lexers typically return token identifiers.尽管词法分析器通常会返回标记标识符,但您必须通过副作用将缓冲区设置为匹配。

Anyway, you'd gain the power of regexps without the dependencies but at the expense of generated (read: unreadable) code.无论如何,您将在没有依赖项的情况下获得正则表达式的强大功能,但代价是生成(读取:不可读)代码。

/* original argument */
char *path = "/my/complete/path/to/80/01/a9/1d.pdf";
char *begin;
char hex[9];
size_t len;

/* find "pattern" */
begin = strstr(path, "/8");
if (!begin) return 1;

// sanity check
len = strlen(begin);
if (len < 12) return 2; 

   // more sanity
if (begin[3] != '/' || begin[6] != '/' ||  begin[9] != '/' ) return 3;

memcpy(hex,   begin+1, 2);
memcpy(hex+2, begin+4, 2);
memcpy(hex+4, begin+7, 2);
memcpy(hex+6, begin+10, 2);
       hex[8] = 0;     

   // For additional sanity, you could check for valid hex characters here
/* print final char array */
printf("%s\n", hex);

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM