删除 Python 注释/文档字符串的脚本

Question

Is there a Python script or tool available which can remove comments and docstrings from Python source?是否有可用的 Python 脚本或工具可以从 Python 源中删除注释和文档字符串？

It should take care of cases like:它应该处理以下情况：

"""
aas
"""
def f():
    m = {
        u'x':
            u'y'
        } # faake docstring ;)
    if 1:
        'string' >> m
    if 2:
        'string' , m
    if 3:
        'string' > m

So at last I have come up with a simple script, which uses the tokenize module and removes comment tokens.所以最后我想出了一个简单的脚本，它使用 tokenize 模块并删除评论标记。 It seems to work pretty well, except that I am not able to remove docstrings in all cases.它似乎工作得很好，除了我无法在所有情况下删除文档字符串。 See if you can improve it to remove docstrings.看看你是否可以改进它以删除文档字符串。

import cStringIO
import tokenize

def remove_comments(src):
    """
    This reads tokens using tokenize.generate_tokens and recombines them
    using tokenize.untokenize, and skipping comment/docstring tokens in between
    """
    f = cStringIO.StringIO(src)
    class SkipException(Exception): pass
    processed_tokens = []
    last_token = None
    # go thru all the tokens and try to skip comments and docstrings
    for tok in tokenize.generate_tokens(f.readline):
        t_type, t_string, t_srow_scol, t_erow_ecol, t_line = tok

        try:
            if t_type == tokenize.COMMENT:
                raise SkipException()

            elif t_type == tokenize.STRING:

                if last_token is None or last_token[0] in [tokenize.INDENT]:
                    # FIXEME: this may remove valid strings too?
                    #raise SkipException()
                    pass

        except SkipException:
            pass
        else:
            processed_tokens.append(tok)

        last_token = tok

    return tokenize.untokenize(processed_tokens)

Also I would like to test it on a very large collection of scripts with good unit test coverage.此外，我想在具有良好单元测试覆盖率的大量脚本上对其进行测试。 Can you suggest such a open source project?你能推荐这样一个开源项目吗？

Answer 1

I'm the author of the " mygod, he has written a python interpreter using regex... " (ie pyminifier) mentioned at that link below =).我是“我的上帝，他使用正则表达式编写了一个python解释器...... ”（即pyminifier）的作者，在下面的链接中提到=）。
I just wanted to chime in and say that I've improved the code quite a bit using the tokenizer module (which I discovered thanks to this question =) ).我只是想插嘴说我已经使用 tokenizer 模块对代码进行了相当多的改进（由于这个问题我发现了这个 =)）。

You'll be happy to note that the code no longer relies so much on regular expressions and uses tokenizer to great effect.您会很高兴地注意到，该代码不再过多依赖正则表达式，而是使用标记器产生了巨大的效果。 Anyway, here's the remove_comments_and_docstrings() function from pyminifier无论如何，这是来自 pyminifier 的remove_comments_and_docstrings()函数
(Note: It works properly with the edge cases that previously-posted code breaks on): （注意：它适用于先前发布的代码中断的边缘情况）：

import cStringIO, tokenize
def remove_comments_and_docstrings(source):
    """
    Returns 'source' minus comments and docstrings.
    """
    io_obj = cStringIO.StringIO(source)
    out = ""
    prev_toktype = tokenize.INDENT
    last_lineno = -1
    last_col = 0
    for tok in tokenize.generate_tokens(io_obj.readline):
        token_type = tok[0]
        token_string = tok[1]
        start_line, start_col = tok[2]
        end_line, end_col = tok[3]
        ltext = tok[4]
        # The following two conditionals preserve indentation.
        # This is necessary because we're not using tokenize.untokenize()
        # (because it spits out code with copious amounts of oddly-placed
        # whitespace).
        if start_line > last_lineno:
            last_col = 0
        if start_col > last_col:
            out += (" " * (start_col - last_col))
        # Remove comments:
        if token_type == tokenize.COMMENT:
            pass
        # This series of conditionals removes docstrings:
        elif token_type == tokenize.STRING:
            if prev_toktype != tokenize.INDENT:
        # This is likely a docstring; double-check we're not inside an operator:
                if prev_toktype != tokenize.NEWLINE:
                    # Note regarding NEWLINE vs NL: The tokenize module
                    # differentiates between newlines that start a new statement
                    # and newlines inside of operators such as parens, brackes,
                    # and curly braces.  Newlines inside of operators are
                    # NEWLINE and newlines that start new code are NL.
                    # Catch whole-module docstrings:
                    if start_col > 0:
                        # Unlabelled indentation means we're inside an operator
                        out += token_string
                    # Note regarding the INDENT token: The tokenize module does
                    # not label indentation inside of an operator (parens,
                    # brackets, and curly braces) as actual indentation.
                    # For example:
                    # def foo():
                    #     "The spaces before this docstring are tokenize.INDENT"
                    #     test = [
                    #         "The spaces before this string do not get a token"
                    #     ]
        else:
            out += token_string
        prev_toktype = token_type
        last_col = end_col
        last_lineno = end_line
    return out

Answer 2

This does the job:这可以完成以下工作：

""" Strip comments and docstrings from a file.
"""

import sys, token, tokenize

def do_file(fname):
    """ Run on just one file.

    """
    source = open(fname)
    mod = open(fname + ",strip", "w")

    prev_toktype = token.INDENT
    first_line = None
    last_lineno = -1
    last_col = 0

    tokgen = tokenize.generate_tokens(source.readline)
    for toktype, ttext, (slineno, scol), (elineno, ecol), ltext in tokgen:
        if 0:   # Change to if 1 to see the tokens fly by.
            print("%10s %-14s %-20r %r" % (
                tokenize.tok_name.get(toktype, toktype),
                "%d.%d-%d.%d" % (slineno, scol, elineno, ecol),
                ttext, ltext
                ))
        if slineno > last_lineno:
            last_col = 0
        if scol > last_col:
            mod.write(" " * (scol - last_col))
        if toktype == token.STRING and prev_toktype == token.INDENT:
            # Docstring
            mod.write("#--")
        elif toktype == tokenize.COMMENT:
            # Comment
            mod.write("##\n")
        else:
            mod.write(ttext)
        prev_toktype = toktype
        last_col = ecol
        last_lineno = elineno

if __name__ == '__main__':
    do_file(sys.argv[1])

I'm leaving stub comments in the place of docstrings and comments since it simplifies the code.我将用存根注释代替文档字符串和注释，因为它简化了代码。 If you remove them completely, you also have to get rid of indentation before them.如果您完全删除它们，则还必须删除它们之前的缩进。

Answer 3

Here is a modification of Dan's solution to make it run for Python3 + also remove empty lines + make it ready-to-use:这是对Dan 的解决方案的修改，使其可用于 Python3 + 还删除空行 + 使其可供使用：

import io, tokenize, re
def remove_comments_and_docstrings(source):
    io_obj = io.StringIO(source)
    out = ""
    prev_toktype = tokenize.INDENT
    last_lineno = -1
    last_col = 0
    for tok in tokenize.generate_tokens(io_obj.readline):
        token_type = tok[0]
        token_string = tok[1]
        start_line, start_col = tok[2]
        end_line, end_col = tok[3]
        ltext = tok[4]
        if start_line > last_lineno:
            last_col = 0
        if start_col > last_col:
            out += (" " * (start_col - last_col))
        if token_type == tokenize.COMMENT:
            pass
        elif token_type == tokenize.STRING:
            if prev_toktype != tokenize.INDENT:
                if prev_toktype != tokenize.NEWLINE:
                    if start_col > 0:
                        out += token_string
        else:
            out += token_string
        prev_toktype = token_type
        last_col = end_col
        last_lineno = end_line
    out = '\n'.join(l for l in out.splitlines() if l.strip())
    return out
with open('test.py', 'r') as f:
    print(remove_comments_and_docstrings(f.read()))

Answer 4

This recipe here claims to do what you want. 此食谱在这里声称可以做您想要的。 And a few other things too. 还有其他一些事情。

Answer 5

I found an easier way to do this with the ast and astunparse module (available from pip).我找到了一种更简单的方法来使用 ast 和 astunparse 模块（可从 pip 获得）。 It converts the code text into a syntax tree, and then the astunparse module prints the code back out again without the comments.它将代码文本转换为语法树，然后 astunparse 模块再次打印出没有注释的代码。 I had to strip out the docstrings with a simple matching, but it seems to work.我不得不用简单的匹配去除文档字符串，但它似乎有效。 I've been looking through output and so far the only downside of this method is that it strips all newlines from your code.我一直在查看输出，到目前为止，这种方法的唯一缺点是它从代码中删除了所有换行符。

import ast, astunparse

with open('my_module.py') as f:
    lines = astunparse.unparse(ast.parse(f.read())).split('\n')
    for line in lines:
        if line.lstrip()[:1] not in ("'", '"'):
            print(line)

Answer 6

Try testing each chunk of tokens ending with NEWLINE.尝试测试以 NEWLINE 结尾的每个令牌块。 Then correct pattern for docstring (including cases where it serves as comment, but isn't assigned to __doc__ ) I believe is (assuming match is performed from start of file of after NEWLINE):然后正确的 docstring 模式（包括它作为注释的情况，但没有分配给__doc__ ）我相信是（假设匹配是从 NEWLINE 之后的文件开始执行的）：

( DEDENT+ | INDENT? ) STRING+ COMMENT? NEWLINE

This should handle all tricky cases: string concatenation, line continuation, module/class/function docstrings, comment in the sameline after string.这应该处理所有棘手的情况：字符串连接、行继续、模块/类/函数文档字符串、字符串后同一行中的注释。 Note, there is a difference between NL and NEWLINE tokens, so we don't need to worry about single string of the line inside expression.请注意，NL 和 NEWLINE 标记之间存在差异，因此我们无需担心表达式中的行的单个字符串。

Answer 7

I've just used the code given by Dan McDougall, and I've found two problems.我刚刚使用了 Dan McDougall 给出的代码，我发现了两个问题。

There were too many empty new lines, so I decided to remove line every time we have two consecutive new lines有太多空的新行，所以我决定每次有两个连续的新行时删除该行
When the Python code was processed all spaces were missing (except indentation) and so such things as "import Anything" changed into "importAnything" which caused problems.在处理 Python 代码时，所有空格都丢失了（缩进除外），因此诸如“导入任何东西”之类的东西变成了“importAnything”，从而导致了问题。 I added spaces after and before reserved Python words which needed it done.我在需要完成的保留 Python 单词前后添加了空格。 I hope I didn't make any mistake there.我希望我没有在那里犯任何错误。

I think I have fixed both things with adding (before return) few more lines:我想我已经通过添加（返回之前）多行解决了这两件事：

# Removing unneeded newlines from string
buffered_content = cStringIO.StringIO(content) # Takes the string generated by Dan McDougall's code as input
content_without_newlines = ""
previous_token_type = tokenize.NEWLINE
for tokens in tokenize.generate_tokens(buffered_content.readline):
    token_type = tokens[0]
    token_string = tokens[1]
    if previous_token_type == tokenize.NL and token_type == tokenize.NL:
        pass
    else:
        # add necessary spaces
        prev_space = ''
        next_space = ''
        if token_string in ['and', 'as', 'or', 'in', 'is']:
            prev_space = ' '
        if token_string in ['and', 'del', 'from', 'not', 'while', 'as', 'elif', 'global', 'or', 'with', 'assert', 'if', 'yield', 'except', 'import', 'print', 'class', 'exec', 'in', 'raise', 'is', 'return', 'def', 'for', 'lambda']:
            next_space = ' '
        content_without_newlines += prev_space + token_string + next_space # This will be our new output!
    previous_token_type = token_type

Answer 8

I was trying to create a program that would count all lines in a python file, ignoring blank lines, lines with comments and docstrings.我试图创建一个程序来计算 python 文件中的所有行，忽略空行、带有注释和文档字符串的行。 Here is my solution:这是我的解决方案：

with open(file_path, 'r', encoding='utf-8') as pyt_file:
  count = 0
  docstring = False

  for i_line in pyt_file.readlines():

    cur_line = i_line.rstrip().replace(' ', '')

    if cur_line.startswith('"""') and not docstring:
      marks_counter = Counter(cur_line)
      if marks_counter['"'] == 6:
        count -= 1
      else:
        docstring = True

    elif cur_line.startswith('"""') and docstring:
      count -= 1
      docstring = False

    if len(cur_line) > 0 and not cur_line.startswith('#') and not docstring:
      count += 1

My problem was to detect the docstrings (including both one-lines and multi-lines), so I suppose if you want to delete those you can try to use the same Flag-solution.我的问题是检测文档字符串（包括单行和多行），所以我想如果你想删除那些你可以尝试使用相同的标志解决方案。

PS I understand that it is an old quiestion but when I was dealing with my problem I couldn't find anything simple and effective PS我知道这是一个老问题，但是当我处理我的问题时，我找不到任何简单有效的东西

删除 Python 注释/文档字符串的脚本

问题描述

7 个解决方案

解决方案1
25 2010-06-03 01:05:02

解决方案2
10 已采纳 2009-11-20 10:23:52

解决方案3
3 2020-05-28 20:44:41

解决方案4
2 2009-11-20 09:35:18

解决方案5
2 2019-05-24 02:45:21

解决方案6
1 2009-11-20 11:09:18

解决方案7
0 2012-09-12 21:09:56

解决方案8
0 2022-08-28 10:30:41

删除 Python 注释/文档字符串的脚本

问题描述

7 个解决方案

解决方案1 25 2010-06-03 01:05:02

解决方案2 10 已采纳 2009-11-20 10:23:52

解决方案3 3 2020-05-28 20:44:41

解决方案4 2 2009-11-20 09:35:18

解决方案5 2 2019-05-24 02:45:21

解决方案6 1 2009-11-20 11:09:18

解决方案7 0 2012-09-12 21:09:56

解决方案8 0 2022-08-28 10:30:41

解决方案1
25 2010-06-03 01:05:02

解决方案2
10 已采纳 2009-11-20 10:23:52

解决方案3
3 2020-05-28 20:44:41

解决方案4
2 2009-11-20 09:35:18

解决方案5
2 2019-05-24 02:45:21

解决方案6
1 2009-11-20 11:09:18

解决方案7
0 2012-09-12 21:09:56

解决方案8
0 2022-08-28 10:30:41