简体   繁体   English

strstr vs regex in c

[英]strstr vs regex in c

Let's say, for example, I have a list of user id's, access times, program names, and version numbers as a list of CSV strings, like this: 比方说,我有一个用户ID,访问时间,程序名称和版本号列表作为CSV字符串列表,如下所示:

1,1342995305,Some Program,0.98
1,1342995315,Some Program,1.20
2,1342985305,Another Program,15.8.3
1,1342995443,Bob's favorite game,0.98
3,1238543846,Something else,
...

Assume this list is not a file, but is an in-memory list of strings. 假设此列表不是文件,而是内存中的字符串列表。

Now let's say I want to find out how often a program has been accessed to certain programs, as listed by their version number. 现在让我们说我想知道程序被访问某些程序的频率,按其版本号列出。 (eg "Some Program version 1.20" was accessed 193 times, "Some Program version 0.98" was accessed 876 times, and "Some Program 1.0.1" was accessed 1,932 times) (例如, “Some Program version 1.20”被访问了193次, “Some Program version 0.98”被访问了876次,而“Some Program 1.0.1”被访问了1,932次)

Would it be better to build a regular expression and then use regexec() to find the matches and pull out the version numbers, or strstr() to match the program name plus comma, and then just read the following part of the string as the version number? 构建正则表达式然后使用regexec()查找匹配项并提取版本号或strstr()以匹配程序名称加上逗号,然后只读取字符串的以下部分作为版本号? If it makes a difference, assume I am using GCC on Linux. 如果它有所作为,假设我在Linux上使用GCC。

Is there a performance difference? 有性能差异吗? Is one method "better" or "more proper" than the other? 一种方法比另一种方法“更好”还是“更合适”? Does it matter at all? 它有关系吗?

使用strstr() - 使用正则表达式计算出现次数并不是一个好主意,因为你还是需要使用循环,所以我建议你做一个简单的循环来搜索子字符串的poistion并增加计数器和启动每场比赛后搜索位置。

strchr/memcmp is how most libc versions implemented strstr. strchr / memcmp是大多数libc版本实现strstr的方式。 Hardware-dependent implementations of strstr in glibc do better. glibc中strstr的硬件相关实现做得更好。 Both SSE2 and SSE4.2 (x86) instruction sets can do way better than scanning byte-by-byte. SSE2和SSE4.2(x86)指令集都可以比逐字节扫描更好。 If you want to see how, I posted a couple blog articles a while back --- SSE2 and strstr and SSE2 and BNDM search --- that you might find interesting. 如果你想看看如何,我发布了一些博客文章 - SSE2和strstr以及SSE2和BNDM搜索 ---你可能会觉得有趣。

strtok(),并将数据分解为更结构化的东西(如结构列表)。

I'd do neither: I'm betting it would be faster to use strchr() to find the commas, and strcmp() to check the program name. 我不会这样做:我认为使用strchr()查找逗号会更快,而strcmp()会检查程序名称。

As for performance, I expect string functions ( strtok / strstr / strchr / strpos / strcmp ...) to run all more or less at the same speed (ie really, really fast), and regex to run appreciably slower albeit still quite fast. 至于性能,我希望字符串函数( strtok / strstr / strchr / strpos / strcmp ...)或多或少以相同的速度运行(即真的,非常快),而正则表达式运行速度要慢得多,尽管速度相当快。

The real performance benefit would come from properly designing the search though: how many times it must run, is the number of programs fixed...? 真正的性能优势来自于正确设计搜索:它必须运行多少次,是修复的程序数量......?

For example, a single scan whereby you get ALL the frequency data for all the programs would be much slower than a single scan seeking for a given program. 例如,单次扫描,您获得所有程序的所有频率数据将比寻求给定程序的单次扫描慢得多。 But properly designed, all subsequent queries for other programs would run way faster. 但设计得当,其他程序的所有后续查询都会更快。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM