Python正則表達式意外替換漢字

Question

我有一個中文字典條目列表（基於 cc-cedict），其中包含以下格式的中文和拉丁字符的混合，用換行符分隔：

（來源.txt）

繁體字簡體字、拼音、定義

山牆山牆,shan1 qiang2,gable

B型超聲B型超聲,B xing2 chao1生1,B型超聲

我想在繁體和簡體字符之間加一個逗號：

（想要的結果）

山牆,山牆,shan1 qiang2,gable

B型超聲,B型超聲,B xing2 chao1 sheng1,B型超聲

在regex101 中進行了一些實驗后，我想出了這個模式：

[Az]*[\䌀-\鿿]+(\\s)[Az]*[\䌀-\鿿]+,

我嘗試使用以下代碼在 Python 中應用此模式：

import re
sourcepath = 'sourcefile.txt'
destpath = 'result.txt'
pattern = '[A-z]*[\u4300-\u9fff]+(\s)[A-z]*[\u4300-\u9fff]+,'

source = open(sourcepath, 'r').read()
dest = open(destpath, 'w')
result = re.sub(pattern, ',', source)
dest.write(result)
dest.close()

但是當我打開result.txt時，得到的結果並不是我所期望的：

,shan1qiang2,山牆

, B xing2 chao1生1,B型超聲

我還嘗試使用具有這種模式的 regexp 模塊：

[Az]*\\p{Han}(\\s)[Az]*\\p{Han}

但結果是一樣的。

我認為通過將 \\s 字符放在括號中，它將構成一個捕獲組，並且只會替換該空格。 但看起來漢字也被替換了。 我是否在正則表達式、代碼或兩者中犯了錯誤？ 我應該如何更改它以獲得所需的結果？

Answer 1

如果您有奇數個中文“單詞”，您的模式應考慮重疊匹配。 使用前瞻：

re.sub(r'(?i)[A-Z]*[\u4300-\u9fff]+(?=\s+[A-Z]*[\u4300-\u9fff]+)', r'\g<0>,', source)
                                   ^^^                         ^

或者使用原子組模擬，在正向前瞻內進行捕獲，並結合消費模式中的反向引用，並先行檢查是否已經存在逗號：

re.sub(r'(?i)[A-Z]*(?=([\u4300-\u9fff]+))\1(?!,)', r'\g<0>,', source) 
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

請參閱正則表達式演示（和演示 2 ） - 不要注意\\x{}符號，它僅用於演示，因為我使用的是 PHP 選項）。

請參閱IDEONE Python 3 演示：

import re
p = re.compile(r'[A-Z]*(?=([\u4300-\u9fff]+))\1(?!,)', re.IGNORECASE | re.U)
test_str = "山牆 山牆,shan1 qiang2,gable\nB型超聲 B型超聲, B xing2 chao1 sheng1,type-B ultrasound"
result = p.sub(r"\g<0>,", test_str)
print(result)
# => 山牆, 山牆,shan1 qiang2,gable
# => B型超聲, B型超聲, B xing2 chao1 sheng1,type-B ultrasound

Answer 2

我認為通過將 \\s 字符放在括號中，它將構成一個捕獲組，並且只會替換該空格。

這不是捕獲組的工作方式。 匹配的所有內容仍然會被替換，但使用捕獲組，您可以參考替換中匹配的部分。

我會更改腳本的兩行：

pattern = '(?i)([a-z]*[\u4300-\u9fff]+)\s([a-z]*[\u4300-\u9fff]+)'

和

result = re.sub(pattern, '\g<0>,\g<1>', source)

Answer 3

使用您的示例代碼在Python 3.5進行了測試：

result = re.sub(r"([\u4e00-\u9fff]+)\s+(?:[a-z]+)?([\u4e00-\u9fff]+)", r"\1,\2", subject, 0, re.IGNORECASE)

正則表達式解釋

([\u4e00-\u9fff]+)\s+(?:[a-z]+)?([\u4e00-\u9fff]+)

Options: Case insensitive; Regex syntax only

Match the regex below and capture its match into backreference number 1 «([\u4e00-\u9fff]+)»
   Match a single character in the range between these two characters «[\u4e00-\u9fff]+»
      Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
      The character “一” which occupies Unicode code point U+4E00 «\u4e00»
      The Unicode character with code point U+9FFF «\u9fff»
Match a single character that is a “whitespace character” (any Unicode separator, tab, line feed, carriage return, vertical tab, form feed, next line) «\s+»
   Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
Match the regular expression below «(?:[a-z]+)?»
   Between zero and one times, as many times as possible, giving back as needed (greedy) «?»
   Match a single character in the range between “a” and “z” (case insensitive) «[a-z]+»
      Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
Match the regex below and capture its match into backreference number 2 «([\u4e00-\u9fff]+)»
   Match a single character in the range between these two characters «[\u4e00-\u9fff]+»
      Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
      The character “一” which occupies Unicode code point U+4E00 «\u4e00»
      The Unicode character with code point U+9FFF «\u9fff»

\1,\2

Insert the text that was last matched by capturing group number 1 «\1»
Insert the character string “,” literally «,»
Insert the text that was last matched by capturing group number 2 «\2»

Python正則表達式意外替換漢字

問題描述

3 個解決方案

解決方案1
1 已采納 2016-05-02 11:48:08

解決方案2
0 2016-05-02 11:25:55

解決方案3
0 2016-05-02 11:36:07

Python正則表達式意外替換漢字

問題描述

3 個解決方案

解決方案1 1 已采納 2016-05-02 11:48:08

解決方案2 0 2016-05-02 11:25:55

解決方案3 0 2016-05-02 11:36:07

解決方案1
1 已采納 2016-05-02 11:48:08

解決方案2
0 2016-05-02 11:25:55

解決方案3
0 2016-05-02 11:36:07