[英]Python regex unexpectedly replacing Chinese characters
我有一個中文字典條目列表(基於 cc-cedict),其中包含以下格式的中文和拉丁字符的混合,用換行符分隔:
(來源.txt)
繁體字 簡體字、拼音、定義
山牆 山牆,shan1 qiang2,gable
B型超聲B型超聲,B xing2 chao1生1,B型超聲
我想在繁體和簡體字符之間加一個逗號:
(想要的結果)
山牆,山牆,shan1 qiang2,gable
B型超聲,B型超聲,B xing2 chao1 sheng1,B型超聲
在regex101 中進行了一些實驗后,我想出了這個模式:
[Az]*[\䌀-\鿿]+(\\s)[Az]*[\䌀-\鿿]+,
我嘗試使用以下代碼在 Python 中應用此模式:
import re
sourcepath = 'sourcefile.txt'
destpath = 'result.txt'
pattern = '[A-z]*[\u4300-\u9fff]+(\s)[A-z]*[\u4300-\u9fff]+,'
source = open(sourcepath, 'r').read()
dest = open(destpath, 'w')
result = re.sub(pattern, ',', source)
dest.write(result)
dest.close()
但是當我打開result.txt時,得到的結果並不是我所期望的:
,shan1qiang2,山牆
, B xing2 chao1生1,B型超聲
我還嘗試使用具有這種模式的 regexp 模塊:
[Az]*\\p{Han}(\\s)[Az]*\\p{Han}
但結果是一樣的。
我認為通過將 \\s 字符放在括號中,它將構成一個捕獲組,並且只會替換該空格。 但看起來漢字也被替換了。 我是否在正則表達式、代碼或兩者中犯了錯誤? 我應該如何更改它以獲得所需的結果?
如果您有奇數個中文“單詞”,您的模式應考慮重疊匹配。 使用前瞻:
re.sub(r'(?i)[A-Z]*[\u4300-\u9fff]+(?=\s+[A-Z]*[\u4300-\u9fff]+)', r'\g<0>,', source)
^^^ ^
或者使用原子組模擬,在正向前瞻內進行捕獲,並結合消費模式中的反向引用,並先行檢查是否已經存在逗號:
re.sub(r'(?i)[A-Z]*(?=([\u4300-\u9fff]+))\1(?!,)', r'\g<0>,', source)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
請參閱正則表達式演示(和演示 2 ) - 不要注意\\x{}
符號,它僅用於演示,因為我使用的是 PHP 選項)。
import re
p = re.compile(r'[A-Z]*(?=([\u4300-\u9fff]+))\1(?!,)', re.IGNORECASE | re.U)
test_str = "山牆 山牆,shan1 qiang2,gable\nB型超聲 B型超聲, B xing2 chao1 sheng1,type-B ultrasound"
result = p.sub(r"\g<0>,", test_str)
print(result)
# => 山牆, 山牆,shan1 qiang2,gable
# => B型超聲, B型超聲, B xing2 chao1 sheng1,type-B ultrasound
我認為通過將 \\s 字符放在括號中,它將構成一個捕獲組,並且只會替換該空格。
這不是捕獲組的工作方式。 匹配的所有內容仍然會被替換,但使用捕獲組,您可以參考替換中匹配的部分。
我會更改腳本的兩行:
pattern = '(?i)([a-z]*[\u4300-\u9fff]+)\s([a-z]*[\u4300-\u9fff]+)'
和
result = re.sub(pattern, '\g<0>,\g<1>', source)
使用您的示例代碼在Python 3.5
進行了測試:
result = re.sub(r"([\u4e00-\u9fff]+)\s+(?:[a-z]+)?([\u4e00-\u9fff]+)", r"\1,\2", subject, 0, re.IGNORECASE)
正則表達式解釋
([\u4e00-\u9fff]+)\s+(?:[a-z]+)?([\u4e00-\u9fff]+)
Options: Case insensitive; Regex syntax only
Match the regex below and capture its match into backreference number 1 «([\u4e00-\u9fff]+)»
Match a single character in the range between these two characters «[\u4e00-\u9fff]+»
Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
The character “一” which occupies Unicode code point U+4E00 «\u4e00»
The Unicode character with code point U+9FFF «\u9fff»
Match a single character that is a “whitespace character” (any Unicode separator, tab, line feed, carriage return, vertical tab, form feed, next line) «\s+»
Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
Match the regular expression below «(?:[a-z]+)?»
Between zero and one times, as many times as possible, giving back as needed (greedy) «?»
Match a single character in the range between “a” and “z” (case insensitive) «[a-z]+»
Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
Match the regex below and capture its match into backreference number 2 «([\u4e00-\u9fff]+)»
Match a single character in the range between these two characters «[\u4e00-\u9fff]+»
Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
The character “一” which occupies Unicode code point U+4E00 «\u4e00»
The Unicode character with code point U+9FFF «\u9fff»
\1,\2
Insert the text that was last matched by capturing group number 1 «\1»
Insert the character string “,” literally «,»
Insert the text that was last matched by capturing group number 2 «\2»
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.