简体   繁体   中英

Tokenize elements from a text file by removing comments, extra spaces and blank lines in C++

I'm trying to eliminate comments, blank lines and extra spaces within a text file, then tokenize the elements leftover. Each token needs a space before and after.

exampleFile.txt
var

/* declare variables */a1 ,
b2a ,     c,

Here's what's working as of now,

string line; //line: represents one line of text from file
ifstream InputFile("exampleFile", ios::in); //read from exampleFile.txt

//Remove comments
while (InputFile && getline(InputFile, line, '\0'))
{
    while (line.find("/*") != string::npos)
    {
        size_t Begin = line.find("/*");
        line.erase(Begin, (line.find("*/", Begin) - Begin) + 2);
        // Start at Begin, erase from Begin to where */ is found
    }   
}

This removes comments, but I can't seem to figure out a way to tokenize while this is happening.

So my questions are:

  • Is it possible to remove comments, spaces, and empty lines and tokenize all in this while statement?
  • How can I implement a function to add spaces in between each token before they are tokenized? Tokens like c, need to be recognized as c and , individually.

Thank you in advanced for the help!

If you need to skip whitespace characters and you don't care about new lines then I'd recommend reading the file with operator>> . You could write simply:

std::string word;
bool isComment = false;
while(file >> word)
{
    if (isInsideComment(word, isComment))
        continue;

     // do processing of the tokens here
     std::cout << word << std::endl;
}

Where the helper function could be implemented as follows:

bool isInsideComment(std::string &word, bool &isComment)
{
    const std::string tagStart = "/*";
    const std::string tagStop = "*/";

    // match start marker
    if (std::equal(tagStart.rbegin(), tagStart.rend(), word.rbegin())) // ends with tagStart
    {
        isComment = true;
        if (word == tagStart)
            return true;

        word = word.substr(0, word.find(tagStart));
        return false;
    }

    // match end marker
    if (isComment)
    {
        if (std::equal(tagStop.begin(), tagStop.end(), word.begin())) // starts with tagStop
        {
            isComment = false;
            word = word.substr(tagStop.size());
            return false;
        }

        return true;
    }

    return false;
}

For your example this would print out:

var
a1
,
b2a
,
c,

The above logic should also handle multiline comments if you're interested.

However, denote that the function implementation should be modified according to what are your assumptions regarding the comment tokens. For instance, are they always separated with whitespaces from other words ? Or is it possible that a var1/*comment*/var2 expression would be parsed? The above example won't work in such situation.

Hence, another option would be (what you already started implementing) reading lines or even chunks of data from the file (to assure begin and end comment tokens are matched) and learning positions of the comment markers with find or regex to remove them afterwards.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM