[英]Convert quotation marks to Latex format with Python
tl; dr版本
我的段落可能包含引號(例如“等等等等”,“也這樣”)。 現在,我必須借助python 3.0將其替換為乳膠樣式的引號(例如“ blah blah”,“ this also”等)。
背景
我有很多純文本文件(超過100個)。 現在,我需要對這些文件進行很少的文本處理,然后制作一個包含這些文件中內容的單個Latex文檔。 我為此使用Python 3.0。 現在,我可以使其他所有內容(例如轉義符,節等)正常工作,但是我無法正確獲取引號。
我能找到與正則表達式模式(如描述在這里 ),但我怎么用給定的模式取代它呢? 在這種情況下,我不知道如何使用“ re.sub()”函數。 因為我的字符串中可能存在多個引號實例。 有與此相關的問題,但是如何使用python實現呢?
"double-quotes"
和'single-quotes'
。 可能還有其他引號(請參閱此問題 ) 's
包含單引號(如don't
, John's
)。 它們的特征是在引號兩邊都帶有字母字符 the actresses' roles
) import re
def texify_single_quote(in_string):
in_string = ' ' + in_string #Hack (see explanations)
return re.sub(r"(?<=\s)'(?!')(.*?)'", r"`\1'", in_string)[1:]
def texify_double_quote(in_string):
return re.sub(r'"(.*?)"', r"``\1''", in_string)
with open("test.txt", 'r') as fd_in, open("output.txt", 'w') as fd_out:
for line in fd_in.readlines():
#Test for commutativity
assert texify_single_quote(texify_double_quote(in_string)) == texify_double_quote(texify_single_quote(in_string))
line = texify_single_quote(line)
line = texify_double_quote(line)
fd_out.write(line)
輸入文件( test.txt
):
# 'single', 'single', "double"
# 'single', "double", 'single'
# "double", 'single', 'single'
# "double", "double", 'single'
# "double", 'single', "double"
# I'm a 'single' person
# I'm a "double" person?
# Ownership for plural words; the peoples' 'rights'
# John's dog barked 'Woof!', and Fred's parents' 'loving' cat ran away.
# "A double-quoted phrase, with a 'single' quote inside"
# 'A single-quoted phrase with a "double quote" inside, with contracted words such as "don't"'
# 'A single-quoted phrase with a regular noun such as actresses' roles'
輸出( output.txt
):
# `single', `single', ``double''
# `single', ``double'', `single'
# ``double'', `single', `single'
# ``double'', ``double'', `single'
# ``double'', `single', ``double''
# I'm a `single' person
# I'm a ``double'' person?
# Ownership for plural words; the peoples' `rights'
# John's dog barked `Woof!', and Fred's parents' `loving' cat ran away.
# ``A double-quoted phrase, with a `single' quote inside''
# `A single-quoted phrase with a ``double quote'' inside, with contracted words such as ``don't'''
# `A single-quoted phrase with a regular noun such as actresses' roles'
( 請注意注釋已添加,以停止對帖子的輸出進行格式化! )
我們將分解此正則表達式模式, (?<=\\s)'(?!')(.*?)'
:
(?<=\\s)'(?!')
處理開頭的單引號,而(.*?)
處理引號中的內容。 (?<=\\s)'
是正向后看,並且僅匹配在其前面有空格( \\s
)的單引號。 這對於防止匹配諸如“ can't
緊縮單詞很重要(考慮3、4)。 '(?!')
是否定的前瞻 ,僅匹配單引號, 而后跟另一個單引號(考慮2)。 (.*?)
捕獲引號之間的內容,而\\1
包含捕獲內容。 in_string = ' ' + in_string
是因為正in_string = ' ' + in_string
不會捕獲從行首開始的單引號,因此為所有行添加了一個空格(然后在切片時return re.sub(...)[1:]
它, return re.sub(...)[1:]
則將其刪除return re.sub(...)[1:]
)解決了這個問題! 正則表達式對於某些任務非常有用,但仍然受到限制(有關更多信息,請閱讀此內容)。 為此任務編寫解析器似乎更容易出錯。
我為此任務創建了一個簡單函數並添加了注釋。 如果仍然對實施有疑問,請詢問。
the_text = '''
This is my \"test\" String
This is my \'test\' String
This is my 'test' String
This is my \"test\" String which has \"two\" quotes
This is my \'test\' String which has \'two\' quotes
This is my \'test\' String which has \"two\" quotes
This is my \"test\" String which has \'two\' quotes
'''
def convert_quotes(txt, quote_type):
# find all quotes
quotes_pos = []
idx = -1
while True:
idx = txt.find(quote_type, idx+1)
if idx == -1:
break
quotes_pos.append(idx)
if len(quotes_pos) % 2 == 1:
raise ValueError('bad number of quotes of type %s' % quote_type)
# replace quote with ``
new_txt = []
last_pos = -1
for i, pos in enumerate(quotes_pos):
# ignore the odd quotes - we dont replace them
if i % 2 == 1:
continue
new_txt += txt[last_pos+1:pos]
new_txt += '``'
last_pos = pos
# append the last part of the string
new_txt += txt[last_pos+1:]
return ''.join(new_txt)
print(convert_quotes(convert_quotes(the_text, '\''), '"'))
打印出:
This is my ``test" String
This is my ``test' String
This is my ``test' String
This is my ``test" String which has ``two" quotes
This is my ``test' String which has ``two' quotes
This is my ``test' String which has ``two" quotes
This is my ``test" String which has ``two' quotes
注意:解析嵌套引號是不明確的。
例如:字符串"bob said: "alice said: hello""
嵌套在正確的語言上
但:
字符串"bob said: hi" and "alice said: hello"
未嵌套。
如果您是這種情況,則可能需要先將這些嵌套引號解析為不同的引號,或使用括號()
消除嵌套引號的歧義。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.