[英]Script to remove Python comments/docstrings
是否有可用的 Python 腳本或工具可以從 Python 源中刪除注釋和文檔字符串?
它應該處理以下情況:
"""
aas
"""
def f():
m = {
u'x':
u'y'
} # faake docstring ;)
if 1:
'string' >> m
if 2:
'string' , m
if 3:
'string' > m
所以最后我想出了一個簡單的腳本,它使用 tokenize 模塊並刪除評論標記。 它似乎工作得很好,除了我無法在所有情況下刪除文檔字符串。 看看你是否可以改進它以刪除文檔字符串。
import cStringIO
import tokenize
def remove_comments(src):
"""
This reads tokens using tokenize.generate_tokens and recombines them
using tokenize.untokenize, and skipping comment/docstring tokens in between
"""
f = cStringIO.StringIO(src)
class SkipException(Exception): pass
processed_tokens = []
last_token = None
# go thru all the tokens and try to skip comments and docstrings
for tok in tokenize.generate_tokens(f.readline):
t_type, t_string, t_srow_scol, t_erow_ecol, t_line = tok
try:
if t_type == tokenize.COMMENT:
raise SkipException()
elif t_type == tokenize.STRING:
if last_token is None or last_token[0] in [tokenize.INDENT]:
# FIXEME: this may remove valid strings too?
#raise SkipException()
pass
except SkipException:
pass
else:
processed_tokens.append(tok)
last_token = tok
return tokenize.untokenize(processed_tokens)
此外,我想在具有良好單元測試覆蓋率的大量腳本上對其進行測試。 你能推薦這樣一個開源項目嗎?
我是“我的上帝,他使用正則表達式編寫了一個python解釋器...... ”(即pyminifier)的作者, 在下面的鏈接中提到=)。
我只是想插嘴說我已經使用 tokenizer 模塊對代碼進行了相當多的改進(由於這個問題我發現了這個 =))。
您會很高興地注意到,該代碼不再過多依賴正則表達式,而是使用標記器產生了巨大的效果。 無論如何,這是來自 pyminifier 的remove_comments_and_docstrings()
函數
(注意:它適用於先前發布的代碼中斷的邊緣情況):
import cStringIO, tokenize
def remove_comments_and_docstrings(source):
"""
Returns 'source' minus comments and docstrings.
"""
io_obj = cStringIO.StringIO(source)
out = ""
prev_toktype = tokenize.INDENT
last_lineno = -1
last_col = 0
for tok in tokenize.generate_tokens(io_obj.readline):
token_type = tok[0]
token_string = tok[1]
start_line, start_col = tok[2]
end_line, end_col = tok[3]
ltext = tok[4]
# The following two conditionals preserve indentation.
# This is necessary because we're not using tokenize.untokenize()
# (because it spits out code with copious amounts of oddly-placed
# whitespace).
if start_line > last_lineno:
last_col = 0
if start_col > last_col:
out += (" " * (start_col - last_col))
# Remove comments:
if token_type == tokenize.COMMENT:
pass
# This series of conditionals removes docstrings:
elif token_type == tokenize.STRING:
if prev_toktype != tokenize.INDENT:
# This is likely a docstring; double-check we're not inside an operator:
if prev_toktype != tokenize.NEWLINE:
# Note regarding NEWLINE vs NL: The tokenize module
# differentiates between newlines that start a new statement
# and newlines inside of operators such as parens, brackes,
# and curly braces. Newlines inside of operators are
# NEWLINE and newlines that start new code are NL.
# Catch whole-module docstrings:
if start_col > 0:
# Unlabelled indentation means we're inside an operator
out += token_string
# Note regarding the INDENT token: The tokenize module does
# not label indentation inside of an operator (parens,
# brackets, and curly braces) as actual indentation.
# For example:
# def foo():
# "The spaces before this docstring are tokenize.INDENT"
# test = [
# "The spaces before this string do not get a token"
# ]
else:
out += token_string
prev_toktype = token_type
last_col = end_col
last_lineno = end_line
return out
這可以完成以下工作:
""" Strip comments and docstrings from a file.
"""
import sys, token, tokenize
def do_file(fname):
""" Run on just one file.
"""
source = open(fname)
mod = open(fname + ",strip", "w")
prev_toktype = token.INDENT
first_line = None
last_lineno = -1
last_col = 0
tokgen = tokenize.generate_tokens(source.readline)
for toktype, ttext, (slineno, scol), (elineno, ecol), ltext in tokgen:
if 0: # Change to if 1 to see the tokens fly by.
print("%10s %-14s %-20r %r" % (
tokenize.tok_name.get(toktype, toktype),
"%d.%d-%d.%d" % (slineno, scol, elineno, ecol),
ttext, ltext
))
if slineno > last_lineno:
last_col = 0
if scol > last_col:
mod.write(" " * (scol - last_col))
if toktype == token.STRING and prev_toktype == token.INDENT:
# Docstring
mod.write("#--")
elif toktype == tokenize.COMMENT:
# Comment
mod.write("##\n")
else:
mod.write(ttext)
prev_toktype = toktype
last_col = ecol
last_lineno = elineno
if __name__ == '__main__':
do_file(sys.argv[1])
我將用存根注釋代替文檔字符串和注釋,因為它簡化了代碼。 如果您完全刪除它們,則還必須刪除它們之前的縮進。
這是對Dan 的解決方案的修改,使其可用於 Python3 + 還刪除空行 + 使其可供使用:
import io, tokenize, re
def remove_comments_and_docstrings(source):
io_obj = io.StringIO(source)
out = ""
prev_toktype = tokenize.INDENT
last_lineno = -1
last_col = 0
for tok in tokenize.generate_tokens(io_obj.readline):
token_type = tok[0]
token_string = tok[1]
start_line, start_col = tok[2]
end_line, end_col = tok[3]
ltext = tok[4]
if start_line > last_lineno:
last_col = 0
if start_col > last_col:
out += (" " * (start_col - last_col))
if token_type == tokenize.COMMENT:
pass
elif token_type == tokenize.STRING:
if prev_toktype != tokenize.INDENT:
if prev_toktype != tokenize.NEWLINE:
if start_col > 0:
out += token_string
else:
out += token_string
prev_toktype = token_type
last_col = end_col
last_lineno = end_line
out = '\n'.join(l for l in out.splitlines() if l.strip())
return out
with open('test.py', 'r') as f:
print(remove_comments_and_docstrings(f.read()))
此食譜在這里聲稱可以做您想要的。 還有其他一些事情。
我找到了一種更簡單的方法來使用 ast 和 astunparse 模塊(可從 pip 獲得)。 它將代碼文本轉換為語法樹,然后 astunparse 模塊再次打印出沒有注釋的代碼。 我不得不用簡單的匹配去除文檔字符串,但它似乎有效。 我一直在查看輸出,到目前為止,這種方法的唯一缺點是它從代碼中刪除了所有換行符。
import ast, astunparse
with open('my_module.py') as f:
lines = astunparse.unparse(ast.parse(f.read())).split('\n')
for line in lines:
if line.lstrip()[:1] not in ("'", '"'):
print(line)
嘗試測試以 NEWLINE 結尾的每個令牌塊。 然后正確的 docstring 模式(包括它作為注釋的情況,但沒有分配給__doc__
)我相信是(假設匹配是從 NEWLINE 之后的文件開始執行的):
( DEDENT+ | INDENT? ) STRING+ COMMENT? NEWLINE
這應該處理所有棘手的情況:字符串連接、行繼續、模塊/類/函數文檔字符串、字符串后同一行中的注釋。 請注意,NL 和 NEWLINE 標記之間存在差異,因此我們無需擔心表達式中的行的單個字符串。
我剛剛使用了 Dan McDougall 給出的代碼,我發現了兩個問題。
我想我已經通過添加(返回之前)多行解決了這兩件事:
# Removing unneeded newlines from string
buffered_content = cStringIO.StringIO(content) # Takes the string generated by Dan McDougall's code as input
content_without_newlines = ""
previous_token_type = tokenize.NEWLINE
for tokens in tokenize.generate_tokens(buffered_content.readline):
token_type = tokens[0]
token_string = tokens[1]
if previous_token_type == tokenize.NL and token_type == tokenize.NL:
pass
else:
# add necessary spaces
prev_space = ''
next_space = ''
if token_string in ['and', 'as', 'or', 'in', 'is']:
prev_space = ' '
if token_string in ['and', 'del', 'from', 'not', 'while', 'as', 'elif', 'global', 'or', 'with', 'assert', 'if', 'yield', 'except', 'import', 'print', 'class', 'exec', 'in', 'raise', 'is', 'return', 'def', 'for', 'lambda']:
next_space = ' '
content_without_newlines += prev_space + token_string + next_space # This will be our new output!
previous_token_type = token_type
我試圖創建一個程序來計算 python 文件中的所有行,忽略空行、帶有注釋和文檔字符串的行。 這是我的解決方案:
with open(file_path, 'r', encoding='utf-8') as pyt_file:
count = 0
docstring = False
for i_line in pyt_file.readlines():
cur_line = i_line.rstrip().replace(' ', '')
if cur_line.startswith('"""') and not docstring:
marks_counter = Counter(cur_line)
if marks_counter['"'] == 6:
count -= 1
else:
docstring = True
elif cur_line.startswith('"""') and docstring:
count -= 1
docstring = False
if len(cur_line) > 0 and not cur_line.startswith('#') and not docstring:
count += 1
我的問題是檢測文檔字符串(包括單行和多行),所以我想如果你想刪除那些你可以嘗試使用相同的標志解決方案。
PS我知道這是一個老問題,但是當我處理我的問題時,我找不到任何簡單有效的東西
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.