I have a txt file which looks like this:
abandon(icl>leave>do,agt>person,obj>person);CAT(CATV),AUX(AVOIR),VAL1(GN) ;
I want to modify it using regular expressions since it's a really long txt. I want before each CAT(...) and after the first ";"to append the first word of each line. There should be also a second ";" after the word appended and before the CAT. How can I do it?
So my output will be:
abandon(icl>leave>do,agt>person,obj>person);abandon;CAT(CATV),AUX(AVOIR),VAL1(GN) ;
You may try the following find and replace, in regex mode:
Find: ^([^(]+)(.*?;)(CAT.*)$
Replace: $1$2$1;$3
The idea here is to just subdivide each line into pieces we need to thread together the replacement. In this case, the first capture group is the word which we plan on inserting after the first semicolon, before CAT
.
Just noticed you are using Python. We can try:
inp = """aarhus(iof>city>thing,equ>arhus);CAT(CATN),N(NP) ;
abadan(iof>city>thing);CAT(CATN),N(NP) ;
abandon(icl>leave>do,agt>person,obj>person);CAT(CATV),AUX(AVOIR),VAL1(GN) ;"""
output = re.sub(r'([^(]+)(.*?;)(CAT.*?;)\s*', '\\1\\2\\1;\\3\n', inp)
print(output)
This prints:
aarhus(iof>city>thing,equ>arhus);aarhus;CAT(CATN),N(NP) ;
abadan(iof>city>thing);abadan;CAT(CATN),N(NP) ;
abandon(icl>leave>do,agt>person,obj>person);abandon;CAT(CATV),AUX(AVOIR),VAL1(GN) ;
In Python you can do this as follows:
import re
test_strings = [
'aarhus(iof>city>thing,equ>arhus);CAT(CATN),N(NP) ;',
'abadan(iof>city>thing);CAT(CATN),N(NP) ;',
'abandon(icl>leave>do,agt>person,obj>person);CAT(CATV),AUX(AVOIR),VAL1(GN) ;'
]
# first group matches the wordthat you want to repeat, then you capture the rest
# until the ;CAT which you capture separately
regex = r'(\w+)(.*)(;CAT.*)'
new_strings = []
for test_string in test_strings:
match = re.match(regex, test_string)
new_string = match.group(1) + match.group(2) + ";" + match.group(1) + match.group(3)
new_strings.append(new_string)
print(new_string)
Gives you:
aarhus(iof>city>thing,equ>arhus);aarhus;CAT(CATN),N(NP) ;
abadan(iof>city>thing);abadan;CAT(CATN),N(NP) ;
abandon(icl>leave>do,agt>person,obj>person);abandon;CAT(CATV),AUX(AVOIR),VAL1(GN) ;
And your strings are stored in the new_strings
list.
EDIT: To read your file as a list of strings ready to be modified just use with open
statement and do readlines()
:
my_file = 'my_text_file.txt'
with open(my_file, 'r') as f:
my_file_as_list = f.readlines()
Matching different groups and knitting may be faster than regex replace. Would have to test
import re
#=== DESIRED ===================================================================
# aarhus(iof>city>thing,equ>arhus);aarhus;CAT(CATN),N(NP) ;
# abadan(iof>city>thing);abadan;CAT(CATN),N(NP) ;
# abandon(icl>leave>do,agt>person,obj>person);abandon;CAT(CATV),AUX(AVOIR),VAL1(GN) ;```
#===============================================================================
data = ["abadan(iof>city>thing);CAT(CATN),N(NP) ;",
"abandon(icl>leave>do,agt>person,obj>person);CAT(CATV),AUX(AVOIR),VAL1(GN) ;"]
# Matching different groups, and then stiching together may be faster tna a regex replace.
# Basedon https://stackoverflow.com/questions/3850074/regex-until-but-not-including
# (?:(?!CAT).)* - match anything until the start of the word CAT.
# I.e.
# (?: # Match the following but do not capture it:
# (?!CAT) # (first assert that it's not possible to match "CAT" here
# . # then match any character
# )* # end of group, zero or more repetitions.
p = ''.join(["^", # Match start of string
"(.*?(?:(?!\().)*)", # Match group one, anything up to first open paren, which will be the first word (I.e. abadan or abandon
"(.*?(?:(?!CAT).)*)", # Group 2, match everything after group one, up to "CAT" but not including CAT
"(.*$)" # Match the rest
])
for line in data:
m = re.match(p, line)
newline = m.group(1) # First word
newline += m.group(2) # Group two
newline += m.group(1) + ";" # First word again with semi-colon
newline += m.group(3) # Group three
print(newline)
OUTPUT:
abadan(iof>city>thing);abadan;CAT(CATN),N(NP) ;
abandon(icl>leave>do,agt>person,obj>person);abandon;CAT(CATV),AUX(AVOIR),VAL1(GN) ;
This script reads the input file, does the replace and writes to output file:
import re
infile = 'input.txt'
outfile = 'outfile.txt'
f = open(infile, 'r')
o = open(outfile, 'w')
for line in f:
o.write(re.sub(r'((\w+).+?)(?=;CAT)', r'\1;\2', line))
cat outfile.txt
aarhus(iof>city>thing,equ>arhus);aarhus;CAT(CATN),N(NP) ;
abadan(iof>city>thing);abadan;CAT(CATN),N(NP) ;
abandon(icl>leave>do,agt>person,obj>person);abandon;CAT(CATV),AUX(AVOIR),VAL1(GN) ;
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.