如何在python中使用正则表达式替换模式？

Question

I have a dataset that looks like this: 我有一个如下所示的数据集：

Male    Name=Tony;  
Female  Name=Alice.1; 
Female  Name=Alice.2;
Male    Name=Ben; 
Male    Name=Shankar; 
Male    Name=Bala; 
Female  Name=Nina; 
###
Female  Name=Alex.1; 
Female  Name=Alex.2;
Male    Name=James; 
Male    Name=Graham; 
Female  Name=Smith;  
###
Female  Name=Xing;
Female  Name=Flora;
Male    Name=Steve.1;
Male    Name=Steve.2; 
Female  Name=Zac;  
###

I want to the change the list so it looks like this: 我想更改列表，所以它看起来像这样：

Male    Name=Class_1;
Female  Name=Class_1.1;
Female  Name=Class_1.2;
Male    Name=Class_1;
Male    Name=Class_1;
Male    Name=Class_1; 
Female  Name=Class_1;
###
Female  Name=Class_2.1; 
Female  Name=Class_2.2; 
Male    Name=Class_2; 
Male    Name=Class_2; 
Female  Name=Class_2;  
###
Female  Name=Class_3; 
Female  Name=Class_3; 
Male    Name=Class_3.1; 
Male    Name=Class_3.2; 
Female  Name=Class_3;
###

Each name has to be changed to the class they belong to. 每个名称都必须更改为它们所属的类。 I noticed that in the dataset, each new class in the list is denoted by a '###'. 我注意到在数据集中，列表中的每个新类都用'###'表示。 So I can split the data set into blocks by '###' and count the instances of ###. 所以我可以通过'###'将数据集拆分成块，并计算###的实例。 Then use regex to look for the names, and replace them by the count of ###. 然后使用正则表达式查找名称，并将其替换为###的计数。

My code looks like this: 我的代码看起来像这样：

blocks = [b.strip() for b in open('/file', 'r').readlines()]
pattern = r'Name=(.*?)[;/]'
prefix = 'Class_'
triple_hash_count = 1

for line in blocks:
    match = re.findall(pattern, line)
    print match

for line in blocks:
    if line == '###':
        triple_hash_count += 1
        print line 
    else: 
        print(line.replace(match, prefix + str(triple_hash_count)))

This doesn't seem to do the job - no replacements are made. 这似乎不起作用 - 没有替换。

Answer 1

When running the code you provided, I got the following traceback output: 运行您提供的代码时，我得到以下回溯输出：

print(line.replace(match, prefix + str(triple_hash_count))) 
TypeError: Can't convert 'list' object to str implicitly

The error happens because type(match) evaluates to a list. 发生错误是因为type(match)评估为列表。 When I inspect this list in PDB, it's an empty list. 当我在PDB中检查此列表时，它是一个空列表。 This is because match has gone out of scope by having two for-loops. 这是因为match已超出范围，有两个for循环。 So let's combine them as such: 所以让我们把它们结合起来：

for line in blocks:
    match = re.findall(pattern, line)
    print(match)

    if line == '###':
        triple_hash_count += 1
        print(line) 
    else: 
        print(line.replace(match, prefix + str(triple_hash_count)))

Now you're getting content in match , but there's still a problem: the return type of re.findall is a list of strings. 现在你在match获得内容，但仍然存在一个问题： re.findall的返回类型是一个字符串列表。 str.replace(...) expects a single string as its first argument. str.replace(...)期望单个字符串作为其第一个参数。

You could cheat, and change the offending line to print(line.replace(match[0], prefix + str(triple_hash_count))) -- but that presumes that you're sure you're going to find a regular expression match on every line that isn't ### . 你可以作弊，并改变要print(line.replace(match[0], prefix + str(triple_hash_count)))的违规行print(line.replace(match[0], prefix + str(triple_hash_count))) - 但这假设您确定要找到正则表达式匹配每一行都不是### 。 A more resilient way would be to check to see that you have the match before you try to call str.replace() on it. 一种更有弹性的方法是在尝试调用str.replace()之前检查是否有匹配。

The final code looks like this: 最终代码如下所示：

for line in blocks:
    match = re.findall(pattern, line)
    print(match)

    if line == '###':
        triple_hash_count += 1
        print(line) 
    else:
        if match: 
            print(line.replace(match[0], prefix + str(triple_hash_count)))
        else:
            print(line)

Two more things: 还有两件事：

On line 11, you mistook the variable name. 在第11行，您误认为变量名称。 It's triple_hash_count , not hash_count . 它是triple_hash_count ，而不是hash_count 。
This code won't actually change the text file provided as input on line 1. You need to write the result of line.replace(match, prefix + str(triple_hash_count)) back to the file, not just print it. 此代码实际上不会更改作为第1行输入提供的文本文件。您需要将line.replace(match, prefix + str(triple_hash_count))的结果写回文件，而不仅仅是打印它。

Answer 2

The problem is rooted in the use of a second loop (as well as a mis-named variable). 问题源于使用第二个循环（以及错误命名的变量）。 This will work. 这会奏效。

import re

blocks = [b.strip() for b in open('/file', 'r').readlines()]
pattern = r'Name=([^\.\d;]*)'
prefix = 'Class_'
triple_hash_count = 1

for line in blocks:

    if line == '###':
        triple_hash_count += 1
        print line     
    else:
        match = re.findall(pattern, line)
        print line.replace(match[0], prefix + str(triple_hash_count))

Answer 3

While you already have your answer, you can do it in just a couple of lines with regular expressions (it could even be a one-liner but this is not very readable): 虽然你已经有了答案，但你可以用几行来表达正常的表达式（它甚至可以是单行，但这不是很易读）：

import re
hashrx = re.compile(r'^###$', re.MULTILINE)
namerx = re.compile(r'Name=\w+(\.\d+)?;')

new_string = '###'.join([namerx.sub(r"Name=Class_{}\1".format(idx + 1), part) 
                for idx,part in enumerate(hashrx.split(string))])
print(new_string)

What it does: 它能做什么：

First, it looks for ### in a single line with the anchors ^ and $ in MULTILINE mode. 首先，它在一行中查找### ，并且在MULTILINE模式下使用锚点^和$ 。
Second, it looks for a possible number after the Name , capturing it into group 1 (but made optional as not all of your names have it). 其次，它在Name之后查找可能的数字，将其捕获到组1中（但是由于并非所有名称都具有可选项，因此可以选择）。
Third, it splits your string by ### and iterates over it with enumerate() , thus having a counter for the numbers to be inserted. 第三，它通过###拆分你的字符串并用enumerate()迭代它，因此有一个计数器用于插入数字。
Lastly, it joins the resulting list by ### again. 最后，它再次通过###加入结果列表。

As a one-liner (though not advisable): 作为一个单行（虽然不可取）：

new_string = '###'.join(
                [re.sub(r'Name=\w+(\.\d+)?;', r"Name=Class_{}\1".format(idx + 1), part) 
                for idx, part in enumerate(re.split(r'^###$', string, flags=re.MULTILINE))])

Demo 演示

A demo says more than thousands words. 一个演示说了超过几千个单词。

如何在python中使用正则表达式替换模式？

问题描述

3 个解决方案

解决方案1
1 已采纳 2017-03-25 20:24:19

解决方案2
1 2017-03-25 20:32:14

解决方案3
1 2017-03-25 21:42:49

What it does: 它能做什么：

As a one-liner (though not advisable): 作为一个单行（虽然不可取）：

Demo 演示

如何在python中使用正则表达式替换模式？

问题描述

3 个解决方案

解决方案1 1 已采纳 2017-03-25 20:24:19

解决方案2 1 2017-03-25 20:32:14

解决方案3 1 2017-03-25 21:42:49

What it does: 它能做什么：

As a one-liner (though not advisable): 作为一个单行（虽然不可取）：

Demo 演示

解决方案1
1 已采纳 2017-03-25 20:24:19

解决方案2
1 2017-03-25 20:32:14

解决方案3
1 2017-03-25 21:42:49