简体   繁体   English

使用Python替换或交换文件中的子串

[英]Using Python to substitute or swap substrings in a file

Suppose I have a line in an ASCII file of the following form: 假设我在以下格式的ASCII文件中有一行:

{text1} {stringA} {text2} {stringB} {text3}

where {stringA} and {stringB} are substrings of interest. 其中{stringA}{stringB}是感兴趣的子串。 Let's call them "A" and "B" respectively. 我们分别称它们为“A”和“B”。 The strings {text1} , {text2} , and {text3} are strings of any length (possibly empty) that do not contain either A or B. 字符串{text1}{text2}{text3}是任何长度(可能为空)的字符串,不包含A或B.

What I want to do in Python is simply swap A and B such that the line goes from 我想在Python中做的只是交换A和B,以便该行来自

{text1} {stringA} {text2} {stringB} {text3}

to

{text1} {stringB} {text2} {stringA} {text3}

I'd appreciate any help here. 我很感激这里有任何帮助。 I think that by getting help on this question, it will help me learn to better work with regular expressions in Python. 我认为通过获得这个问题的帮助,它将帮助我学习如何更好地使用Python中的正则表达式。

Note that {text1} , {text2} , and {text3} are unknown strings. 请注意, {text1}{text2}{text3}是未知字符串。

We know exactly the substrings A and B. We know that A precedes B in the line. 我们确切地知道子串A和B.我们知道A在行中的B之前。 However, we don't know what (if anything) is before/between/after them. 但是,我们不知道它们之前/之间/之后是什么(如果有的话)。

Examples (A=John, B=Tim): 例子(A = John,B = Tim):

(1) This: (1)这个:

"I told John to give the bag to Tim." “我告诉约翰把包给蒂姆。”

is changed to this: 改为:

"I told Tim to give the bag to John." “我告诉蒂姆把行李交给约翰。”

(2) This: (2)这个:

"John said hello to Tim." “约翰向蒂姆问好。”

is changed to this: 改为:

"Tim said hello to John." “蒂姆向约翰问好。”

(3) This: (3)这个:

"John!h9aghagTim" “约翰!h9aghagTim”

is changed to this: 改为:

"Tim!h9aghagJohn" “蒂姆!h9aghagJohn”

>>> import re
>>> text = '{text1} {stringA} {text2} {stringB} {text3}'
>>> re.sub(r'(stringA)(.*)(stringB)', r'\3\2\1', text)
'{text1} {stringB} {text2} {stringA} {text3}'

Replace stringA and stringB with your substrings of interest, note that you may want to re.escape() them in case the substrings can have characters with a special meaning in regex. stringAstringB替换为您感兴趣的子字符串,请注意,如果子字符串在正则表达式中具有特殊含义的字符,您可能需要re.escape()它们。

Test cases: 测试用例:

>>> stringA = 'John'
>>> stringB = 'Tim'
>>> regex = re.compile(r'(%s)(.*)(%s)' % (stringA, stringB))
>>> regex.sub(r'\3\2\1', "I told John to give the bag to Tim.")
'I told Tim to give the bag to John.'
>>> regex.sub(r'\3\2\1', "John said hello to Tim.")
'Tim said hello to John.'
>>> regex.sub(r'\3\2\1', "John!h9aghagTim")
'Tim!h9aghagJohn'

The approach to go for is to use capturing groups so that you can refer them to later 要采用的方法是使用捕获组,以便以后可以将它们引用

result = re.sub(r"(\{text1\}) (\{stringA\}) (\{text2\}) (\{stringB\}) (\{text3\})", r"\1 \4 \3 \2 \5", subject)

The capture group is identified by the parenthesis () and you refer to them in python by \\x where x is the number of the capture group 捕获组由括号()标识,您可以在python中通过\\ x引用它们,其中x是捕获组的编号

Update 1 更新1

Your examples makes it more obvious what you want and how you currently think about regexes. 您的示例使您更清楚您想要什么,以及您目前如何看待正则表达式。 Regexes match patterns of characters. 正则表达式匹配字符的模式。 You want to swap names (Tom,Tim,...) so we need to come up with a pattern to match a name which is only possible by complete enumeration. 你想交换名字(汤姆,蒂姆,......),所以我们需要提出一个模式来匹配一个名称,这个名称只能通过完整的枚举来实现。 In my language there ar (I think) thousand of first names and some of them are also used to refer to objects and not person. 在我的语言中,我认为有数千个名字,其中一些也用于指代对象,而不是人。 To make that distinction you have to take context into account which a regex cannot. 要做出这种区分,你必须考虑正则表达式不能考虑的上下文。 Let me know if this makes sense cause it's important if you want to go any further. 让我知道这是否有意义,因为如果你想进一步的话,这很重要。

Update 2 更新2

I suspect your question is out of curiosity and not to solve a real life problem. 我怀疑你的问题是出于好奇而不是解决现实生活中的问题。 But if we go along than this would get you far but it's not perfect and cannot be 但是,如果我们继续这样做会让你走得更远,但它并不完美,也不可能

regex 正则表达式

(.*)\b(John|Tim|Jo)\b(.*)\b(John|Tim|Jo)\b

replace with 用。。。来代替

\1\4\3\2

In python 在python中

result = re.sub(r"(?sm)(.*)\b(John|Tim|Jo)\b(.*)\b(John|Tim|Jo)\b", r"\1\4\3\2", subject)

Note the \\b in the regex which states that the match should happen at word boundaries. 注意正则表达式中的\\ b表示匹配应该在字边界处发生。 This prevents matches like Johndoe. 这可以防止像Johndoe这样的比赛。

Also observe that the regex above will fail for the sentence 还要注意上面的正则表达式将失败

Tim bought some top level domains of Jordan that end with Jo from John 蒂姆买了乔丹的一些顶级域名,以约翰的乔结束

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM