简体   繁体   中英

Reading String token by token in C

I'm trying to build an LL(1) Recursive Descent Parser in C using a specific grammar given to me. I have an idea how to do this recursively in general... my issue, however, is stopping me from really being able to start my implementation. I'm not too familiar with C, so I'm sure this is why I'm having an issue. Basically, I need to be able to read a String such as "(1+2)*3" token by token. So for instance, in the case of the String of above me I need to first read the "(" , then further down the recursive process I'd call something like nextToken() which would give me the "1" .

That being said, ultimately I would probably only need to read the very first token of the String each that I call "nextToken() because after I grab the value I'd alter the initial string to be the same as it previously was, minus the most recently read token. So for example, I start with "(1+2)*3" , then I call nextToken() on the String which means that I get the "(" and then the initial String is now "1+2)*3" .

My issue is I don't know how to do this in C..

That's what a "lexer" does, typically before a parser. I guess the best you can do is try LEX (flex in Flex & Bison probably). (It's true that what lexer does can also be done solely in parser, but it's probably much messier.)

A less preferable way would be to categorize all the possibilities and write regular expressions to match some valid prefix (which is what the LEX does under the hood).

In C, a "string" is just a region a memory containing characters, which is terminated by the first NUL (0) character. That being the case, all you need for a string is a pointer to the first character. (That means that the length of the string needs to be computed , so try to avoid doing that more often than is necessary.)

There are standard library functions which can do things like compare strings and copy strings, but it is important to remember that memory management of strings is your responsibility .

While this may seem primitive, error-prone, and complicated to those used to languages in which strings are actual datatypes, it is how it is. If you're planning on doing string manipulation in C, you need to get used to it.

Nonetheless, string manipulation in C can be both efficient and trouble-free, as long as you follow the rules. For example, if you want to refer to the substring of s starting at the 3rd character, you can just use pointer arithmetic: s + 2 . If you want to (temporarily) create a substring at a given point in a string, you can drop a 0 into the string at the end of the substring, and then later restore the character that was there. (In fact, that's what the standard library function strtok does, and it's how a lexical scanner built with (f)lex works.) Note that this strategy requires that the character array be mutable , so you won't be able to apply it to string literals. (String arrays are fine, though, since they are mutable.)

It's quite possible that your best bet for building a lexical scanner would be to use flex . The scanner which flex builds will do a lot of things for you, including input buffering, and flex lets you specify regular expressions instead of hand coding them.

But if you want to do it by hand, it is not that hard, particularly if the entire input is in memory so that buffering is not necessary. (If no token spans a line, you could also read the input a line at a time, but that's not as efficient as reading fixed-length blocks, which is what the flex scanner will do.)

Here, for example, is a simple scanner which handles arithmetic operators, integers, and identifiers. It does not use the "overwrite with NUL" strategy, so it can be used with string literals. For identifiers, it creates a newly-allocated string, so the caller needs to free the identifier when it is no longer needed. (No garbage collection. C'est la vie.) The token is "returned" through a reference argument; the actual return value of the function is a pointer to the remainder of the source string. Quite a lot of error checking has been omitted.

#include <ctype.h>
#include <stdlib.h>
#include <string.h>

/* The type of a single-character operators is the character, so
 * other token types need to start at 256. We use 0 to indicate
 * the end of input token type.
 */
enum TokenType { NUMBER = 256, ID };
typedef struct Token {
  enum TokenType token_type;
  union { /* Anonymous unions are a C11 feature. */
    long      number;  /* Only valid if type is NUMBER */
    char*     id;      /* Only valid if type is ID */
  }; 
} Token;

/* You would normally call this like this:
 * do {
 *   s = next_token(s, &token);
 *   // Do something with token
 * } while (token.token_type);
 */
const char* next_token(const char* input, Token* out) {
  /* Skip whitespace */
  while (isspace(*input)) ++input;
  if (isdigit(*input)) {
    char* lim;
    out->number = strtol(input, &lim, 10);
    out->token_type = NUMBER;
    return lim;
  } else if (isalpha(*input)) {
    const char* lim = input + 1;
    /* Find the end of the id */
    while (isalnum(*lim)) ++lim;
    /* Allocate enough memory to copy the id. We need one extra byte
     * for the NUL
     */
    size_t len = lim - input;
    out->id = malloc(len + 1);
    memcpy(out->id, input, len);
    out->id[len] = 0;  /* NUL-terminate the string */
    out->token_type = ID;
    return lim;
  } else {
    out->token_type = *input;
    /* If we hit the end of the input string, we don't advance the
     * input pointer, to avoid reading random memory.
     */
    return *input ? input + 1 : input;
  }
}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM