简体   繁体   English

使用strtok对C中的字符串进行标记(包括数字作为分隔符)

[英]Tokenize a String in C with strtok (Include Digits as Delimiters)

So I have the following function: 所以我有以下功能:

void tokenize() {
    char *word;
    char text[] = "Some - text, from stdin. We'll see! what happens? 4ND 1F W3 H4V3 NUM83R5?!?";
    int nbr_words = 0;

    word = strtok(text, " ,.-!?()");

    while (word != NULL) {
    printf("%s\n", word);
    word = strtok(NULL, " ,.-!?()");
    nbr_words += 1;
    }
}

And the output is: 输出为:

Some
text
from
stdin
We'll
see
what
happens
4ND
1F
W3
H4V3
NUM83R5


13 words

Basically what I'm doing is tokenizing paragraphs of text into words for futher analysis down the road. 基本上,我正在做的是将文本段落标记为单词,以便在以后进行进一步的分析。 I have my text, and I have my delimiters. 我有文字,也有分隔符。 The only problem is tokenizing numbers at the same time as all the rest of the delimiters. 唯一的问题是与所有其他定界符同时标记数字。 I know that I can use isdigit in ctype.h . 我知道我可以在ctype.h使用isdigit However, I don't know how I can include it in the strtok . 但是,我不知道如何将其包括在strtok

For example (obviously wrong): strtok(paragraph, " ,.-!?()isdigit()"); 例如(显然是错误的): strtok(paragraph, " ,.-!?()isdigit()");

Something along those lines. 遵循这些原则。 But since I have each token (word) at this stage, is there some kind of post-processing if statement I could use to further tokenize each word, splitting at digits? 但是,由于我在此阶段拥有每个标记(单词),因此if我可以使用语句进一步对每个单词进行标记化(以位数分割),是否会进行某种后处理?

For example, the output would further degrade to: 例如,输出将进一步降级为:

ND
F
W
H
V
NUM
R

15 words // updated counter to include new tokens

strtok is very simple in this respect: just list all the digits as delimiters, one by one - like this: 在这方面, strtok非常简单:只需将所有数字一一列出即可,例如:

strtok(paragraph, " ,.-!?()0123456789");

Note: strtok is an old, non-reentrant function that should not be used in modern programs. 注意: strtok是一个古老的,不可重入的函数,不应在现代程序中使用。 You should switch to strtok_r , which has a similar interface, but can be used in concurrent environments and other situations when you need reentrancy. 您应该切换到strtok_r ,它具有类似的接口,但是可以在并发环境和其他需要重新进入的情况下使用。

为什么不只是使用

    word = strtok(text, " ,.-!?()1234567890");

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM