簡體   English   中英

刪除 Python 注釋/文檔字符串的腳本

[英]Script to remove Python comments/docstrings

是否有可用的 Python 腳本或工具可以從 Python 源中刪除注釋和文檔字符串?

它應該處理以下情況:

"""
aas
"""
def f():
    m = {
        u'x':
            u'y'
        } # faake docstring ;)
    if 1:
        'string' >> m
    if 2:
        'string' , m
    if 3:
        'string' > m

所以最后我想出了一個簡單的腳本,它使用 tokenize 模塊並刪除評論標記。 它似乎工作得很好,除了我無法在所有情況下刪除文檔字符串。 看看你是否可以改進它以刪除文檔字符串。

import cStringIO
import tokenize

def remove_comments(src):
    """
    This reads tokens using tokenize.generate_tokens and recombines them
    using tokenize.untokenize, and skipping comment/docstring tokens in between
    """
    f = cStringIO.StringIO(src)
    class SkipException(Exception): pass
    processed_tokens = []
    last_token = None
    # go thru all the tokens and try to skip comments and docstrings
    for tok in tokenize.generate_tokens(f.readline):
        t_type, t_string, t_srow_scol, t_erow_ecol, t_line = tok

        try:
            if t_type == tokenize.COMMENT:
                raise SkipException()

            elif t_type == tokenize.STRING:

                if last_token is None or last_token[0] in [tokenize.INDENT]:
                    # FIXEME: this may remove valid strings too?
                    #raise SkipException()
                    pass

        except SkipException:
            pass
        else:
            processed_tokens.append(tok)

        last_token = tok

    return tokenize.untokenize(processed_tokens)

此外,我想在具有良好單元測試覆蓋率的大量腳本上對其進行測試。 你能推薦這樣一個開源項目嗎?

我是“我的上帝,他使用正則表達式編寫了一個python解釋器...... ”(即pyminifier)的作者, 在下面的鏈接中提到=)。
我只是想插嘴說我已經使用 tokenizer 模塊對代碼進行了相當多的改進(由於這個問題我發現了這個 =))。

您會很高興地注意到,該代碼不再過多依賴正則表達式,而是使用標記器產生了巨大的效果。 無論如何,這是來自 pyminifier 的remove_comments_and_docstrings()函數
(注意:它適用於先前發布的代碼中斷的邊緣情況):

import cStringIO, tokenize
def remove_comments_and_docstrings(source):
    """
    Returns 'source' minus comments and docstrings.
    """
    io_obj = cStringIO.StringIO(source)
    out = ""
    prev_toktype = tokenize.INDENT
    last_lineno = -1
    last_col = 0
    for tok in tokenize.generate_tokens(io_obj.readline):
        token_type = tok[0]
        token_string = tok[1]
        start_line, start_col = tok[2]
        end_line, end_col = tok[3]
        ltext = tok[4]
        # The following two conditionals preserve indentation.
        # This is necessary because we're not using tokenize.untokenize()
        # (because it spits out code with copious amounts of oddly-placed
        # whitespace).
        if start_line > last_lineno:
            last_col = 0
        if start_col > last_col:
            out += (" " * (start_col - last_col))
        # Remove comments:
        if token_type == tokenize.COMMENT:
            pass
        # This series of conditionals removes docstrings:
        elif token_type == tokenize.STRING:
            if prev_toktype != tokenize.INDENT:
        # This is likely a docstring; double-check we're not inside an operator:
                if prev_toktype != tokenize.NEWLINE:
                    # Note regarding NEWLINE vs NL: The tokenize module
                    # differentiates between newlines that start a new statement
                    # and newlines inside of operators such as parens, brackes,
                    # and curly braces.  Newlines inside of operators are
                    # NEWLINE and newlines that start new code are NL.
                    # Catch whole-module docstrings:
                    if start_col > 0:
                        # Unlabelled indentation means we're inside an operator
                        out += token_string
                    # Note regarding the INDENT token: The tokenize module does
                    # not label indentation inside of an operator (parens,
                    # brackets, and curly braces) as actual indentation.
                    # For example:
                    # def foo():
                    #     "The spaces before this docstring are tokenize.INDENT"
                    #     test = [
                    #         "The spaces before this string do not get a token"
                    #     ]
        else:
            out += token_string
        prev_toktype = token_type
        last_col = end_col
        last_lineno = end_line
    return out

這可以完成以下工作:

""" Strip comments and docstrings from a file.
"""

import sys, token, tokenize

def do_file(fname):
    """ Run on just one file.

    """
    source = open(fname)
    mod = open(fname + ",strip", "w")

    prev_toktype = token.INDENT
    first_line = None
    last_lineno = -1
    last_col = 0

    tokgen = tokenize.generate_tokens(source.readline)
    for toktype, ttext, (slineno, scol), (elineno, ecol), ltext in tokgen:
        if 0:   # Change to if 1 to see the tokens fly by.
            print("%10s %-14s %-20r %r" % (
                tokenize.tok_name.get(toktype, toktype),
                "%d.%d-%d.%d" % (slineno, scol, elineno, ecol),
                ttext, ltext
                ))
        if slineno > last_lineno:
            last_col = 0
        if scol > last_col:
            mod.write(" " * (scol - last_col))
        if toktype == token.STRING and prev_toktype == token.INDENT:
            # Docstring
            mod.write("#--")
        elif toktype == tokenize.COMMENT:
            # Comment
            mod.write("##\n")
        else:
            mod.write(ttext)
        prev_toktype = toktype
        last_col = ecol
        last_lineno = elineno

if __name__ == '__main__':
    do_file(sys.argv[1])

我將用存根注釋代替文檔字符串和注釋,因為它簡化了代碼。 如果您完全刪除它們,則還必須刪除它們之前的縮進。

這是對Dan 的解決方案的修改,使其可用於 Python3 + 還刪除空行 + 使其可供使用:

import io, tokenize, re
def remove_comments_and_docstrings(source):
    io_obj = io.StringIO(source)
    out = ""
    prev_toktype = tokenize.INDENT
    last_lineno = -1
    last_col = 0
    for tok in tokenize.generate_tokens(io_obj.readline):
        token_type = tok[0]
        token_string = tok[1]
        start_line, start_col = tok[2]
        end_line, end_col = tok[3]
        ltext = tok[4]
        if start_line > last_lineno:
            last_col = 0
        if start_col > last_col:
            out += (" " * (start_col - last_col))
        if token_type == tokenize.COMMENT:
            pass
        elif token_type == tokenize.STRING:
            if prev_toktype != tokenize.INDENT:
                if prev_toktype != tokenize.NEWLINE:
                    if start_col > 0:
                        out += token_string
        else:
            out += token_string
        prev_toktype = token_type
        last_col = end_col
        last_lineno = end_line
    out = '\n'.join(l for l in out.splitlines() if l.strip())
    return out
with open('test.py', 'r') as f:
    print(remove_comments_and_docstrings(f.read()))

此食譜在這里聲稱可以做您想要的。 還有其他一些事情。

我找到了一種更簡單的方法來使用 ast 和 astunparse 模塊(可從 pip 獲得)。 它將代碼文本轉換為語法樹,然后 astunparse 模塊再次打印出沒有注釋的代碼。 我不得不用簡單的匹配去除文檔字符串,但它似乎有效。 我一直在查看輸出,到目前為止,這種方法的唯一缺點是它從代碼中刪除了所有換行符。

import ast, astunparse

with open('my_module.py') as f:
    lines = astunparse.unparse(ast.parse(f.read())).split('\n')
    for line in lines:
        if line.lstrip()[:1] not in ("'", '"'):
            print(line)

嘗試測試以 NEWLINE 結尾的每個令牌塊。 然后正確的 docstring 模式(包括它作為注釋的情況,但沒有分配給__doc__ )我相信是(假設匹配是從 NEWLINE 之后的文件開始執行的):

( DEDENT+ | INDENT? ) STRING+ COMMENT? NEWLINE

這應該處理所有棘手的情況:字符串連接、行繼續、模塊/類/函數文檔字符串、字符串后同一行中的注釋。 請注意,NL 和 NEWLINE 標記之間存在差異,因此我們無需擔心表達式中的行的單個字符串。

我剛剛使用了 Dan McDougall 給出的代碼,我發現了兩個問題。

  1. 有太多空的新行,所以我決定每次有兩個連續的新行時刪除該行
  2. 在處理 Python 代碼時,所有空格都丟失了(縮進除外),因此諸如“導入任何東西”之類的東西變成了“importAnything”,從而導致了問題。 我在需要完成的保留 Python 單詞前后添加了空格。 我希望我沒有在那里犯任何錯誤。

我想我已經通過添加(返回之前)多行解決了這兩件事:

# Removing unneeded newlines from string
buffered_content = cStringIO.StringIO(content) # Takes the string generated by Dan McDougall's code as input
content_without_newlines = ""
previous_token_type = tokenize.NEWLINE
for tokens in tokenize.generate_tokens(buffered_content.readline):
    token_type = tokens[0]
    token_string = tokens[1]
    if previous_token_type == tokenize.NL and token_type == tokenize.NL:
        pass
    else:
        # add necessary spaces
        prev_space = ''
        next_space = ''
        if token_string in ['and', 'as', 'or', 'in', 'is']:
            prev_space = ' '
        if token_string in ['and', 'del', 'from', 'not', 'while', 'as', 'elif', 'global', 'or', 'with', 'assert', 'if', 'yield', 'except', 'import', 'print', 'class', 'exec', 'in', 'raise', 'is', 'return', 'def', 'for', 'lambda']:
            next_space = ' '
        content_without_newlines += prev_space + token_string + next_space # This will be our new output!
    previous_token_type = token_type

我試圖創建一個程序來計算 python 文件中的所有行,忽略空行、帶有注釋和文檔字符串的行。 這是我的解決方案:

with open(file_path, 'r', encoding='utf-8') as pyt_file:
  count = 0
  docstring = False

  for i_line in pyt_file.readlines():

    cur_line = i_line.rstrip().replace(' ', '')

    if cur_line.startswith('"""') and not docstring:
      marks_counter = Counter(cur_line)
      if marks_counter['"'] == 6:
        count -= 1
      else:
        docstring = True

    elif cur_line.startswith('"""') and docstring:
      count -= 1
      docstring = False

    if len(cur_line) > 0 and not cur_line.startswith('#') and not docstring:
      count += 1

我的問題是檢測文檔字符串(包括單行和多行),所以我想如果你想刪除那些你可以嘗試使用相同的標志解決方案。

PS我知道這是一個老問題,但是當我處理我的問題時,我找不到任何簡單有效的東西

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM