[英]Regex to append some characters in a certain position
我有一個 txt 文件,如下所示:
abandon(icl>leave>do,agt>person,obj>person);CAT(CATV),AUX(AVOIR),VAL1(GN) ;
我想使用正則表達式對其進行修改,因為它是一個非常長的 txt。 我想在每個 CAT(...) 之前和第一個“;”之后到 append 每行的第一個字。 應該還有第二個“;” 在附加的單詞之后和 CAT 之前。 我該怎么做?
所以我的 output 將是:
abandon(icl>leave>do,agt>person,obj>person);abandon;CAT(CATV),AUX(AVOIR),VAL1(GN) ;
您可以在正則表達式模式下嘗試以下查找和替換:
Find: ^([^(]+)(.*?;)(CAT.*)$
Replace: $1$2$1;$3
這里的想法是將每一行細分為我們需要將替換連接在一起的部分。 在這種情況下,第一個捕獲組是我們計划在第一個分號之后、 CAT
之前插入的單詞。
剛剛注意到您正在使用 Python。 我們能試試:
inp = """aarhus(iof>city>thing,equ>arhus);CAT(CATN),N(NP) ;
abadan(iof>city>thing);CAT(CATN),N(NP) ;
abandon(icl>leave>do,agt>person,obj>person);CAT(CATV),AUX(AVOIR),VAL1(GN) ;"""
output = re.sub(r'([^(]+)(.*?;)(CAT.*?;)\s*', '\\1\\2\\1;\\3\n', inp)
print(output)
這打印:
aarhus(iof>city>thing,equ>arhus);aarhus;CAT(CATN),N(NP) ;
abadan(iof>city>thing);abadan;CAT(CATN),N(NP) ;
abandon(icl>leave>do,agt>person,obj>person);abandon;CAT(CATV),AUX(AVOIR),VAL1(GN) ;
在 Python 中,您可以執行以下操作:
import re
test_strings = [
'aarhus(iof>city>thing,equ>arhus);CAT(CATN),N(NP) ;',
'abadan(iof>city>thing);CAT(CATN),N(NP) ;',
'abandon(icl>leave>do,agt>person,obj>person);CAT(CATV),AUX(AVOIR),VAL1(GN) ;'
]
# first group matches the wordthat you want to repeat, then you capture the rest
# until the ;CAT which you capture separately
regex = r'(\w+)(.*)(;CAT.*)'
new_strings = []
for test_string in test_strings:
match = re.match(regex, test_string)
new_string = match.group(1) + match.group(2) + ";" + match.group(1) + match.group(3)
new_strings.append(new_string)
print(new_string)
給你:
aarhus(iof>city>thing,equ>arhus);aarhus;CAT(CATN),N(NP) ;
abadan(iof>city>thing);abadan;CAT(CATN),N(NP) ;
abandon(icl>leave>do,agt>person,obj>person);abandon;CAT(CATV),AUX(AVOIR),VAL1(GN) ;
您的字符串存儲在new_strings
列表中。
編輯:要將文件讀取為准備修改的字符串列表,只需使用with open
語句並執行readlines()
:
my_file = 'my_text_file.txt'
with open(my_file, 'r') as f:
my_file_as_list = f.readlines()
匹配不同的組和編織可能比正則表達式替換更快。 將不得不測試
import re
#=== DESIRED ===================================================================
# aarhus(iof>city>thing,equ>arhus);aarhus;CAT(CATN),N(NP) ;
# abadan(iof>city>thing);abadan;CAT(CATN),N(NP) ;
# abandon(icl>leave>do,agt>person,obj>person);abandon;CAT(CATV),AUX(AVOIR),VAL1(GN) ;```
#===============================================================================
data = ["abadan(iof>city>thing);CAT(CATN),N(NP) ;",
"abandon(icl>leave>do,agt>person,obj>person);CAT(CATV),AUX(AVOIR),VAL1(GN) ;"]
# Matching different groups, and then stiching together may be faster tna a regex replace.
# Basedon https://stackoverflow.com/questions/3850074/regex-until-but-not-including
# (?:(?!CAT).)* - match anything until the start of the word CAT.
# I.e.
# (?: # Match the following but do not capture it:
# (?!CAT) # (first assert that it's not possible to match "CAT" here
# . # then match any character
# )* # end of group, zero or more repetitions.
p = ''.join(["^", # Match start of string
"(.*?(?:(?!\().)*)", # Match group one, anything up to first open paren, which will be the first word (I.e. abadan or abandon
"(.*?(?:(?!CAT).)*)", # Group 2, match everything after group one, up to "CAT" but not including CAT
"(.*$)" # Match the rest
])
for line in data:
m = re.match(p, line)
newline = m.group(1) # First word
newline += m.group(2) # Group two
newline += m.group(1) + ";" # First word again with semi-colon
newline += m.group(3) # Group three
print(newline)
OUTPUT:
abadan(iof>city>thing);abadan;CAT(CATN),N(NP) ;
abandon(icl>leave>do,agt>person,obj>person);abandon;CAT(CATV),AUX(AVOIR),VAL1(GN) ;
此腳本讀取輸入文件,進行替換並寫入 output 文件:
import re
infile = 'input.txt'
outfile = 'outfile.txt'
f = open(infile, 'r')
o = open(outfile, 'w')
for line in f:
o.write(re.sub(r'((\w+).+?)(?=;CAT)', r'\1;\2', line))
cat outfile.txt
aarhus(iof>city>thing,equ>arhus);aarhus;CAT(CATN),N(NP) ;
abadan(iof>city>thing);abadan;CAT(CATN),N(NP) ;
abandon(icl>leave>do,agt>person,obj>person);abandon;CAT(CATV),AUX(AVOIR),VAL1(GN) ;
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.