简体   繁体   English

正则表达式到 append 某些字符 position

[英]Regex to append some characters in a certain position

I have a txt file which looks like this:我有一个 txt 文件,如下所示:

abandon(icl>leave>do,agt>person,obj>person);CAT(CATV),AUX(AVOIR),VAL1(GN) ; 

I want to modify it using regular expressions since it's a really long txt.我想使用正则表达式对其进行修改,因为它是一个非常长的 txt。 I want before each CAT(...) and after the first ";"我想在每个 CAT(...) 之前和第一个“;”之后to append the first word of each line.到 append 每行的第一个字。 There should be also a second ";"应该还有第二个“;” after the word appended and before the CAT.在附加的单词之后和 CAT 之前。 How can I do it?我该怎么做?

So my output will be:所以我的 output 将是:

abandon(icl>leave>do,agt>person,obj>person);abandon;CAT(CATV),AUX(AVOIR),VAL1(GN) ;

You may try the following find and replace, in regex mode:您可以在正则表达式模式下尝试以下查找和替换:

Find:    ^([^(]+)(.*?;)(CAT.*)$
Replace: $1$2$1;$3

The idea here is to just subdivide each line into pieces we need to thread together the replacement.这里的想法是将每一行细分为我们需要将替换连接在一起的部分。 In this case, the first capture group is the word which we plan on inserting after the first semicolon, before CAT .在这种情况下,第一个捕获组是我们计划在第一个分号之后、 CAT之前插入的单词。

Demo演示

Just noticed you are using Python.刚刚注意到您正在使用 Python。 We can try:我们能试试:

inp = """aarhus(iof>city>thing,equ>arhus);CAT(CATN),N(NP) ;
abadan(iof>city>thing);CAT(CATN),N(NP) ;
abandon(icl>leave>do,agt>person,obj>person);CAT(CATV),AUX(AVOIR),VAL1(GN) ;"""
output = re.sub(r'([^(]+)(.*?;)(CAT.*?;)\s*', '\\1\\2\\1;\\3\n', inp)
print(output)

This prints:这打印:

aarhus(iof>city>thing,equ>arhus);aarhus;CAT(CATN),N(NP) ;
abadan(iof>city>thing);abadan;CAT(CATN),N(NP) ;
abandon(icl>leave>do,agt>person,obj>person);abandon;CAT(CATV),AUX(AVOIR),VAL1(GN) ;

In Python you can do this as follows:在 Python 中,您可以执行以下操作:

import re

test_strings = [
    'aarhus(iof>city>thing,equ>arhus);CAT(CATN),N(NP) ;',
    'abadan(iof>city>thing);CAT(CATN),N(NP) ;',
    'abandon(icl>leave>do,agt>person,obj>person);CAT(CATV),AUX(AVOIR),VAL1(GN) ;' 
]
# first group matches the wordthat you want to repeat, then you capture the rest
# until the ;CAT which you capture separately
regex = r'(\w+)(.*)(;CAT.*)'

new_strings = []
for test_string in test_strings:
    match = re.match(regex, test_string)
    new_string = match.group(1) + match.group(2) + ";" + match.group(1) + match.group(3)
    new_strings.append(new_string)
    print(new_string)

Gives you:给你:

aarhus(iof>city>thing,equ>arhus);aarhus;CAT(CATN),N(NP) ;
abadan(iof>city>thing);abadan;CAT(CATN),N(NP) ;
abandon(icl>leave>do,agt>person,obj>person);abandon;CAT(CATV),AUX(AVOIR),VAL1(GN) ;

And your strings are stored in the new_strings list.您的字符串存储在new_strings列表中。

EDIT: To read your file as a list of strings ready to be modified just use with open statement and do readlines() :编辑:要将文件读取为准备修改的字符串列表,只需使用with open语句并执行readlines()

my_file = 'my_text_file.txt'

with open(my_file, 'r') as f:
    my_file_as_list = f.readlines()

Matching different groups and knitting may be faster than regex replace.匹配不同的组和编织可能比正则表达式替换更快。 Would have to test将不得不测试

import re

#=== DESIRED ===================================================================
# aarhus(iof>city>thing,equ>arhus);aarhus;CAT(CATN),N(NP) ;
# abadan(iof>city>thing);abadan;CAT(CATN),N(NP) ;
# abandon(icl>leave>do,agt>person,obj>person);abandon;CAT(CATV),AUX(AVOIR),VAL1(GN) ;```
#===============================================================================

data = ["abadan(iof>city>thing);CAT(CATN),N(NP) ;", 
"abandon(icl>leave>do,agt>person,obj>person);CAT(CATV),AUX(AVOIR),VAL1(GN) ;"]

# Matching different groups, and then stiching together may be faster tna a regex replace. 
# Basedon https://stackoverflow.com/questions/3850074/regex-until-but-not-including
# (?:(?!CAT).)* - match anything until the start of the word CAT.
# I.e.
# (?:        # Match the following but do not capture it:
# (?!CAT)  # (first assert that it's not possible to match "CAT" here
#  .         # then match any character
# )*         # end of group, zero or more repetitions.
p = ''.join(["^", # Match start of string
             "(.*?(?:(?!\().)*)", # Match group one, anything up to first open paren, which will be the first word (I.e. abadan or abandon
             "(.*?(?:(?!CAT).)*)", # Group 2, match everything after group one, up to "CAT" but not including CAT
             "(.*$)" # Match the rest
             ])

for line in data:
    m = re.match(p, line)    
    newline  = m.group(1) # First word
    newline += m.group(2) # Group two
    newline += m.group(1) + ";" # First word again with semi-colon
    newline += m.group(3) # Group three

    print(newline)

OUTPUT: OUTPUT:

abadan(iof>city>thing);abadan;CAT(CATN),N(NP) ;
abandon(icl>leave>do,agt>person,obj>person);abandon;CAT(CATV),AUX(AVOIR),VAL1(GN) ;

This script reads the input file, does the replace and writes to output file:此脚本读取输入文件,进行替换并写入 output 文件:

import re

infile = 'input.txt'
outfile = 'outfile.txt'
f = open(infile, 'r')
o = open(outfile, 'w')
for line in f:
    o.write(re.sub(r'((\w+).+?)(?=;CAT)', r'\1;\2', line))

cat outfile.txt 
aarhus(iof>city>thing,equ>arhus);aarhus;CAT(CATN),N(NP) ;
abadan(iof>city>thing);abadan;CAT(CATN),N(NP) ;
abandon(icl>leave>do,agt>person,obj>person);abandon;CAT(CATV),AUX(AVOIR),VAL1(GN) ; 

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM