簡體   English   中英

Python正則表達式意外替換漢字

[英]Python regex unexpectedly replacing Chinese characters

我有一個中文字典條目列表(基於 cc-cedict),其中包含以下格式的中文和拉丁字符的混合,用換行符分隔:

(來源.txt)

繁體字 簡體字、拼音、定義

山牆 山牆,shan1 qiang2,gable

B型超聲B型超聲,B xing2 chao1生1,B型超聲

我想在繁體和簡體字符之間加一個逗號:

(想要的結果)

山牆,山牆,shan1 qiang2,gable

B型超聲,B型超聲,B xing2 chao1 sheng1,B型超聲

regex101 中進行了一些實驗后,我想出了這個模式:

[Az]*[\䌀-\鿿]+(\\s)[Az]*[\䌀-\鿿]+,

我嘗試使用以下代碼在 Python 中應用此模式:

import re
sourcepath = 'sourcefile.txt'
destpath = 'result.txt'
pattern = '[A-z]*[\u4300-\u9fff]+(\s)[A-z]*[\u4300-\u9fff]+,'

source = open(sourcepath, 'r').read()
dest = open(destpath, 'w')
result = re.sub(pattern, ',', source)
dest.write(result)
dest.close()

但是當我打開result.txt時,得到的結果並不是我所期望的:

,shan1qiang2,山牆

, B xing2 chao1生1,B型超聲

我還嘗試使用具有這種模式的 regexp 模塊:

[Az]*\\p{Han}(\\s)[Az]*\\p{Han}

但結果是一樣的。

我認為通過將 \\s 字符放在括號中,它將構成一個捕獲組,並且只會替換該空格。 但看起來漢字也被替換了。 我是否在正則表達式、代碼或兩者中犯了錯誤? 我應該如何更改它以獲得所需的結果?

如果您有奇數個中文“單詞”,您的模式應考慮重疊匹配。 使用前瞻:

re.sub(r'(?i)[A-Z]*[\u4300-\u9fff]+(?=\s+[A-Z]*[\u4300-\u9fff]+)', r'\g<0>,', source)
                                   ^^^                         ^

或者使用原子組模擬,在正向前瞻內進行捕獲,並結合消費模式中的反向引用,並先行檢查是否已經存在逗號:

re.sub(r'(?i)[A-Z]*(?=([\u4300-\u9fff]+))\1(?!,)', r'\g<0>,', source) 
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 

請參閱正則表達式演示(和演示 2 ) - 不要注意\\x{}符號,它僅用於演示,因為我使用的是 PHP 選項)。

請參閱IDEONE Python 3 演示

import re
p = re.compile(r'[A-Z]*(?=([\u4300-\u9fff]+))\1(?!,)', re.IGNORECASE | re.U)
test_str = "山牆 山牆,shan1 qiang2,gable\nB型超聲 B型超聲, B xing2 chao1 sheng1,type-B ultrasound"
result = p.sub(r"\g<0>,", test_str)
print(result)
# => 山牆, 山牆,shan1 qiang2,gable
# => B型超聲, B型超聲, B xing2 chao1 sheng1,type-B ultrasound

我認為通過將 \\s 字符放在括號中,它將構成一個捕獲組,並且只會替換該空格。

這不是捕獲組的工作方式。 匹配的所有內容仍然會被替換,但使用捕獲組,您可以參考替換中匹配的部分。

我會更改腳本的兩行:

pattern = '(?i)([a-z]*[\u4300-\u9fff]+)\s([a-z]*[\u4300-\u9fff]+)'

result = re.sub(pattern, '\g<0>,\g<1>', source)

使用您的示例代碼在Python 3.5進行了測試:

result = re.sub(r"([\u4e00-\u9fff]+)\s+(?:[a-z]+)?([\u4e00-\u9fff]+)", r"\1,\2", subject, 0, re.IGNORECASE)

正則表達式解釋

([\u4e00-\u9fff]+)\s+(?:[a-z]+)?([\u4e00-\u9fff]+)

Options: Case insensitive; Regex syntax only

Match the regex below and capture its match into backreference number 1 «([\u4e00-\u9fff]+)»
   Match a single character in the range between these two characters «[\u4e00-\u9fff]+»
      Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
      The character “一” which occupies Unicode code point U+4E00 «\u4e00»
      The Unicode character with code point U+9FFF «\u9fff»
Match a single character that is a “whitespace character” (any Unicode separator, tab, line feed, carriage return, vertical tab, form feed, next line) «\s+»
   Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
Match the regular expression below «(?:[a-z]+)?»
   Between zero and one times, as many times as possible, giving back as needed (greedy) «?»
   Match a single character in the range between “a” and “z” (case insensitive) «[a-z]+»
      Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
Match the regex below and capture its match into backreference number 2 «([\u4e00-\u9fff]+)»
   Match a single character in the range between these two characters «[\u4e00-\u9fff]+»
      Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
      The character “一” which occupies Unicode code point U+4E00 «\u4e00»
      The Unicode character with code point U+9FFF «\u9fff»

\1,\2

Insert the text that was last matched by capturing group number 1 «\1»
Insert the character string “,” literally «,»
Insert the text that was last matched by capturing group number 2 «\2»

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM