簡體   English   中英

使用Python將引號轉換為Latex格式

[英]Convert quotation marks to Latex format with Python

tl; dr版本

我的段落可能包含引號(例如“等等等等”,“也這樣”)。 現在,我必須借助python 3.0將其替換為乳膠樣式的引號(例如“ blah blah”,“ this also”等)。

背景

我有很多純文本文件(超過100個)。 現在,我需要對這些文件進行很少的文本處理,然后制作一個包含這些文件中內容的單個Latex文檔。 我為此使用Python 3.0。 現在,我可以使其他所有內容(例如轉義符,節等)正常工作,但是我無法正確獲取引號。

我能找到與正則表達式模式(如描述在這里 ),但我怎么用給定的模式取代它呢? 在這種情況下,我不知道如何使用“ re.sub()”函數。 因為我的字符串中可能存在多個引號實例。 有與相關的問題,但是如何使用python實現呢?

設計注意事項

  1. 我只考慮了常規的"double-quotes"'single-quotes' 可能還有其他引號(請參閱此問題
  2. LaTeX的雙引號也是單引號-我們不想捕獲LaTeX的雙引號(例如``LaTeX雙引號'')並將其誤認為是單引號(大約什么也沒有)
  3. 字收縮和所有權's包含單引號(如don'tJohn's )。 它們的特征是在引號兩邊都帶有字母字符
  4. 常規名詞(多個所有權)在單詞后加單引號(例如the actresses' roles

import re

def texify_single_quote(in_string):
    in_string = ' ' + in_string #Hack (see explanations)
    return re.sub(r"(?<=\s)'(?!')(.*?)'", r"`\1'", in_string)[1:]

def texify_double_quote(in_string):
    return re.sub(r'"(.*?)"', r"``\1''", in_string)

測試

with open("test.txt", 'r') as fd_in, open("output.txt", 'w') as fd_out:
    for line in fd_in.readlines():

        #Test for commutativity
        assert texify_single_quote(texify_double_quote(in_string)) == texify_double_quote(texify_single_quote(in_string))

        line = texify_single_quote(line)
        line = texify_double_quote(line)
        fd_out.write(line)

輸入文件( test.txt ):

# 'single', 'single', "double"
# 'single', "double", 'single'
# "double", 'single', 'single'
# "double", "double", 'single'
# "double", 'single', "double"
# I'm a 'single' person
# I'm a "double" person?
# Ownership for plural words; the peoples' 'rights'
# John's dog barked 'Woof!', and Fred's parents' 'loving' cat ran away.
# "A double-quoted phrase, with a 'single' quote inside"
# 'A single-quoted phrase with a "double quote" inside, with contracted words such as "don't"'
# 'A single-quoted phrase with a regular noun such as actresses' roles'

輸出( output.txt ):

# `single', `single', ``double''
# `single', ``double'', `single'
# ``double'', `single', `single'
# ``double'', ``double'', `single'
# ``double'', `single', ``double''
# I'm a `single' person
# I'm a ``double'' person?
# Ownership for plural words; the peoples' `rights'
# John's dog barked `Woof!', and Fred's parents' `loving' cat ran away.
# ``A double-quoted phrase, with a `single' quote inside''
# `A single-quoted phrase with a ``double quote'' inside, with contracted words such as ``don't'''
# `A single-quoted phrase with a regular noun such as actresses' roles'

請注意注釋已添加,以停止對帖子的輸出進行格式化!

說明

我們將分解此正則表達式模式, (?<=\\s)'(?!')(.*?)'

  • 摘要(?<=\\s)'(?!')處理開頭的單引號,而(.*?)處理引號中的內容。
  • (?<=\\s)'正向后看,並且僅匹配在其前面有空格( \\s )的單引號。 這對於防止匹配諸如“ can't緊縮單詞很重要(考慮3、4)。
  • '(?!')否定的前瞻 ,僅匹配單引號, 后跟另一個單引號(考慮2)。
  • 該答案所述 ,模式(.*?)捕獲引號之間的內容,而\\1包含捕獲內容。
  • 之所以使用“ Hack” in_string = ' ' + in_string是因為正in_string = ' ' + in_string 不會捕獲從行首開始的單引號,因此為所有行添加了一個空格(然后在切片時return re.sub(...)[1:]它, return re.sub(...)[1:]則將其刪除return re.sub(...)[1:] )解決了這個問題!

正則表達式對於某些任務非常有用,但仍然受到限制(有關更多信息,請閱讀內容)。 為此任務編寫解析器似乎更容易出錯。

我為此任務創建了一個簡單函數並添加了注釋。 如果仍然對實施有疑問,請詢問。

代碼( 此處在線版本 ):

the_text = '''
This is my \"test\" String
This is my \'test\' String
This is my 'test' String
This is my \"test\" String which has \"two\" quotes
This is my \'test\' String which has \'two\' quotes
This is my \'test\' String which has \"two\" quotes
This is my \"test\" String which has \'two\' quotes
'''


def convert_quotes(txt, quote_type):
    # find all quotes
    quotes_pos = []
    idx = -1

    while True:
        idx = txt.find(quote_type, idx+1)
        if idx == -1:
            break
        quotes_pos.append(idx)

    if len(quotes_pos) % 2 == 1:
        raise ValueError('bad number of quotes of type %s' % quote_type)

    # replace quote with ``
    new_txt = []
    last_pos = -1

    for i, pos in enumerate(quotes_pos):
        # ignore the odd quotes - we dont replace them
        if i % 2 == 1:
            continue
        new_txt += txt[last_pos+1:pos]
        new_txt += '``'
        last_pos = pos

    # append the last part of the string
    new_txt += txt[last_pos+1:]

    return ''.join(new_txt)

print(convert_quotes(convert_quotes(the_text, '\''), '"'))

打印出:

This is my ``test" String
This is my ``test' String
This is my ``test' String
This is my ``test" String which has ``two" quotes
This is my ``test' String which has ``two' quotes
This is my ``test' String which has ``two" quotes
This is my ``test" String which has ``two' quotes

注意:解析嵌套引號是不明確的。

例如:字符串"bob said: "alice said: hello""嵌套在正確的語言上

但:

字符串"bob said: hi" and "alice said: hello"未嵌套。

如果您是這種情況,則可能需要先將這些嵌套引號解析為不同的引號,或使用括號()消除嵌套引號的歧義。

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM