简体   繁体   English

从源代码中删除C ++注释

[英]Removing C++ Comment From Source Code

I have some c++ code with /* */ and // style comments. 我有一些带有/* *///样式注释的c ++代码。 I want to have a way to remove them all automatically. 我想有办法自动删除它们。 Apparently, using an editor (eg ultraedit) with some regexp searching for /* , */ and // should do the job. 显然,使用编辑器(例如ultraedit)和一些正则表达式搜索/**///应该可以完成这项工作。 But, on a closer look, a complete solution isn't that simple because the sequences /* or // may not represent a comment if they're inside another comment, string literal or character literal. 但是,仔细看看,完整的解决方案并不那么简单,因为序列/ *或//如果它们位于另一个注释,字符串文字或字符文字中,则可能不代表注释。 eg 例如

printf(" \" \" " "  /* this is not a comment and is surrounded by an unknown number of double-quotes */");

is a comment sequence inside a double quote. 是双引号内的注释序列。 And, it isn't a simple task to determine if a string is inside a pair of valid double-quotes. 并且,确定字符串是否在一对有效双引号内并不是一项简单的任务。 While this 虽然这个

// this is a single line comment /* <--- this does not start a comment block 
// this is a second comment line with an */ within

is comment sequences inside other comments. 是其他评论中的评论序列。

Is there a more comprehensive way to remove comments from a C++ source taking into account string literal and comment? 是否有更全面的方法从C ++源中删除注释,同时考虑字符串文字和注释? For example can we instruct the preprocessor to remove comments while doesn't carry out, say, #include directive? 例如,我们可以指示预处理器删除注释,而不执行#include指令吗?

The C pre-processor can remove the comments. C预处理器可以删除注释。

Edited: 编辑:

I have updated so that we can use the MACROS to expand the #if statements 我已更新,以便我们可以使用MACROS扩展#if语句

> cat t.cpp
/*
 * Normal comment
 */
// this is a single line comment /* <--- this does not start a comment block 
// this is a second comment line with an */ within
#include <stdio.h>

#if __SIZEOF_LONG__ == 4
int bits = 32;
#else
int bits = 16;
#endif

int main()
{
    printf(" \" \" " " /* this is not a comment and is surrounded by an unknown number of double-quotes */");
    /*
     * comment with a single // line comment enbedded.
     */
    int x;
    // A single line comment /* Normal enbedded */ Comment
}

Because we want the #if statements to expand correctly we need a list of defines. 因为我们希望#if语句正确扩展,所以我们需要一个定义列表。
That's relatively trivial. 那是相对微不足道的。 cpp -E -dM . cpp -E -dM

Then we pipe the #defines and the original file back through the pre-processor but prevent the includes from being expanded this time. 然后我们将#defines和原始文件通过预处理器传回,但这次阻止了包含的扩展。

> cpp -E -dM t.cpp > /tmp/def
> cat /tmp/def t.cpp | sed -e s/^#inc/-#inc/ | cpp - | sed s/^-#inc/#inc/
# 1 "t.cpp"
# 1 "<built-in>"
# 1 "<command-line>"
# 1 "t.cpp"






#include <stdio.h>


int bits = 32;




int main()
{
    printf(" \" \" " " /* this is not a comment and is surrounded by an unknown number of double-quotes */");    



    int x;

}

Our SD C++ Formatter has an option to pretty print the source text and remove all comments. 我们的SD C ++ Formatter可以选择打印源文本并删除所有注释。 It uses our full C++ front end to parse the text, so it is not confused by whitespace, line breaks, string literals or preprocessor issues, nor will it break the code by its formatting changes. 它使用我们的完整C ++前端来解析文本,因此它不会被空格,换行符,字符串文字或预处理器问题混淆,也不会因格式化更改而破坏代码。

If you are removing comments, you may be trying to obfuscate the source code. 如果要删除注释,则可能会尝试对源代码进行模糊处理。 The Formatter also comes in an obfuscating version. Formatter也有一个混淆版本。

You can use a rule-based parser (eg boost::spirit) to write syntax rules for comments. 您可以使用基于规则的解析器(例如boost :: spirit)来编写注释的语法规则。 You will need to decide whether to process nested comments or not depending on your compiler. 您需要根据编译器决定是否处理嵌套注释。 Semantic actions removing comments should be pretty straightforward. 删除注释的语义操作应该非常简单。

Regex are not meant to parse languages, it's a frustrating attempt at best. 正则表达式并不意味着解析语言,这是一个令人沮丧的尝试。

You actually need a full-blown parser for this. 你实际上需要一个完整的解析器。 You might wish to consider Clang , rewriting is an explicit goal of the Clang libraries suite and there are already existing rewriters implemented that you could get inspiration from. 您可能希望考虑Clang ,重写是Clang库套件的明确目标,并且已经实现了可以从中获得灵感的现有重写器。

May someone vote up my own answer to my own question. 愿有人投票给我自己的答案。

Thanks to Martin York's idea, I found that in Visual Studio, the solution looks very simple (subject to further testing). 感谢Martin York的想法,我发现在Visual Studio中,解决方案看起来非常简单(需要进一步测试)。 Just rename ALL preprocessor directives to something else, (something that is not valid c++ syntax is ok) and use the cl.exe with /P 只需将所有预处理器指令重命名为其他东西,(无效的c ++语法就可以了)并使用带有/ P的cl.exe

cl target.cpp /P

and it produces a target.i . 它会产生一个target.i And it contains the source minus the comments. 它包含来源减去评论。 Just rename the previous directives back and there you go. 只需将以前的指令重命名,然后就可以了。 Probably you will need to remove the #line directive generated by cl.exe. 可能您需要删除cl.exe生成的#line指令。

This works because according to MSDN, the phases of translation is this: 这是有效的,因为根据MSDN,翻译的阶段是这样的:

Character mapping Characters in the source file are mapped to the internal source representation. 字符映射源文件中的字符映射到内部源表示。 Trigraph sequences are converted to single-character internal representation in this phase. 在此阶段,Trigraph序列将转换为单字符内部表示。

Line splicing All lines ending in a backslash () and immediately followed by a newline character are joined with the next line in the source file forming logical lines from the physical lines. 线拼接以反斜杠()结尾并紧跟换行符的所有行与源文件中的下一行连接,形成物理线的逻辑线。 Unless it is empty, a source file must end in a newline character that is not preceded by a backslash. 除非它是空的,否则源文件必须以不带反斜杠的换行符结尾。

Tokenization The source file is broken into preprocessing tokens and white-space characters. 标记源文件分为预处理标记和空白字符。 Comments in the source file are replaced with one space character each. 源文件中的注释每个都替换为一个空格字符。 Newline characters are retained. 保留换行符。

Preprocessing Preprocessing directives are executed and macros are expanded into the source file. 执行预处理预处理指令并将宏扩展到源文件中。 The #include statement invokes translation starting with the preceding three translation steps on any included text. #include语句从任何包含的文本的前三个转换步骤开始调用转换。

Character-set mapping All source character set members and escape sequences are converted to their equivalents in the execution character set. 字符集映射所有源字符集成员和转义序列都将转换为执行字符集中的等效项。 For Microsoft C and C++, both the source and the execution character sets are ASCII. 对于Microsoft C和C ++,源和执行字符集都是ASCII。

String concatenation All adjacent string and wide-string literals are concatenated. 字符串连接所有相邻的字符串和宽字符串连接都是连接在一起的。 For example, "String " "concatenation" becomes "String concatenation". 例如,“String”“concatenation”变为“String concatenation”。

Translation All tokens are analyzed syntactically and semantically; 翻译所有标记都在语法和语义上进行分析; these tokens are converted into object code. 这些标记被转换为目标代码。

Linkage All external references are resolved to create an executable program or a dynamic-link library 链接解析所有外部引用以创建可执行程序或动态链接库

Comments are removed during Tokenization prior to Preprocessing phase. 预处理阶段之前的标记化期间删除注释。 So just make sure during the preprocessing phase, nothing is available for processing (removing all the directives) and its output should be just those processed by the previous 3 phases. 因此,只需确保在预处理阶段,没有任何可用于处理(删除所有指令),其输出应该只是前三个阶段处理的那些。

As to the user-defined .h files, use the /FI option to manually include them. 对于用户定义的.h文件,使用/ FI选项手动包含它们。 The resultant .i file will be a combination of the .cpp and .h. 生成的.i文件将是.cpp和.h的组合。 without comments. 没有评论。 Each piece is preceded by a #line with the proper filename. 每个部分前面都有一个带有正确文件名的#line。 So it is easy to split them up by an editor. 因此很容易被编辑器拆分。 If we don't want to manually split them up, probably we need to use the macro/scripting facility of some editors to do it automatically. 如果我们不想手动拆分它们,可能我们需要使用某些编辑器的宏/脚本工具来自动完成它。

So, now, we don't have to care about any of the preprocessor directives. 所以,现在,我们不必关心任何预处理器指令。 Even better is line continuation character (backslash) is handled. 更好的是处理行继续字符(反斜杠)。

eg 例如

// vc8.cpp : Defines the entry point for the console application.
//

-#include "stdafx.h"
-#include <windows.h>
-#define NOERR
-#ifdef NOERR
  /* comment here */
 whatever error line is ok
-#else
  some error line if NOERR not defined
      // comment here
-#endif
void pr() ;
int _tmain(int argc, _TCHAR* argv[])
{
    pr();
    return 0;
}

/*comment*/

void pr() {
    printf(" /* "); /* comment inside string " */
    // comment terminated by \
    continue a comment line
    printf(" "); /** " " string inside comment */
    printf/* this is valid comment within line continuation */\
("some weird lines \
with line continuation");
}

After cl.exe vc8.cpp /P , it becomes this, and can then be fed to cl.exe again after restoring the directives (and removing the #line) cl.exe vc8.cpp /P ,它变成了这个,然后可以在恢复指令后再次将其送到cl.exe(并删除#line)

#line 1 "vc8.cpp"



-#include "stdafx.h"
-#include <windows.h>
-#define NOERR
-#ifdef NOERR

 whatever error line is ok
-#else
  some error line if NOERR not defined

-#endif
void pr() ;
int _tmain(int argc, _TCHAR* argv[])
{
    pr();
    return 0;
}



void pr() {
    printf(" /* "); 


    printf(" "); 
    printf\
("some weird lines \
with line continuation");
}
#include <iostream>
#include<fstream>
using namespace std;

int main() {
    ifstream fin;
    ofstream fout;
    fin.open("input.txt");
    fout.open("output.txt");
    char ch;
    while(!fin.eof()){
        fin.get(ch);
        if(ch=='/'){
            fin.get(ch);
            if(ch=='/' )
            {   //cout<<"Detected\n";
                fin.get(ch);
                while(!(ch=='\n'||ch=='\0'))
                {
                //cout<<"while";
                fin.get(ch);
                }
            }
            if(ch=='*')
            {
                fin.get(ch);
                while(!(ch=='*')){
                    fin.get(ch);
                }
                fin.get(ch);
                if(ch=='/'){
                //  cout<<"Detected Multi-Line\n";
                    fin.get(ch);
                }

            }
        }
        fout<<ch;
    }
    return 0;
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM