简体   繁体   中英

Tokenize a String in C with strtok (Include Digits as Delimiters)

So I have the following function:

void tokenize() {
    char *word;
    char text[] = "Some - text, from stdin. We'll see! what happens? 4ND 1F W3 H4V3 NUM83R5?!?";
    int nbr_words = 0;

    word = strtok(text, " ,.-!?()");

    while (word != NULL) {
    printf("%s\n", word);
    word = strtok(NULL, " ,.-!?()");
    nbr_words += 1;
    }
}

And the output is:

Some
text
from
stdin
We'll
see
what
happens
4ND
1F
W3
H4V3
NUM83R5


13 words

Basically what I'm doing is tokenizing paragraphs of text into words for futher analysis down the road. I have my text, and I have my delimiters. The only problem is tokenizing numbers at the same time as all the rest of the delimiters. I know that I can use isdigit in ctype.h . However, I don't know how I can include it in the strtok .

For example (obviously wrong): strtok(paragraph, " ,.-!?()isdigit()");

Something along those lines. But since I have each token (word) at this stage, is there some kind of post-processing if statement I could use to further tokenize each word, splitting at digits?

For example, the output would further degrade to:

ND
F
W
H
V
NUM
R

15 words // updated counter to include new tokens

strtok is very simple in this respect: just list all the digits as delimiters, one by one - like this:

strtok(paragraph, " ,.-!?()0123456789");

Note: strtok is an old, non-reentrant function that should not be used in modern programs. You should switch to strtok_r , which has a similar interface, but can be used in concurrent environments and other situations when you need reentrancy.

为什么不只是使用

    word = strtok(text, " ,.-!?()1234567890");

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM