简体   繁体   English

如何在python中使用正则表达式替换模式?

[英]How to replace a pattern using regex in python?

I have a dataset that looks like this: 我有一个如下所示的数据集:

Male    Name=Tony;  
Female  Name=Alice.1; 
Female  Name=Alice.2;
Male    Name=Ben; 
Male    Name=Shankar; 
Male    Name=Bala; 
Female  Name=Nina; 
###
Female  Name=Alex.1; 
Female  Name=Alex.2;
Male    Name=James; 
Male    Name=Graham; 
Female  Name=Smith;  
###
Female  Name=Xing;
Female  Name=Flora;
Male    Name=Steve.1;
Male    Name=Steve.2; 
Female  Name=Zac;  
###

I want to the change the list so it looks like this: 我想更改列表,所以它看起来像这样:

Male    Name=Class_1;
Female  Name=Class_1.1;
Female  Name=Class_1.2;
Male    Name=Class_1;
Male    Name=Class_1;
Male    Name=Class_1; 
Female  Name=Class_1;
###
Female  Name=Class_2.1; 
Female  Name=Class_2.2; 
Male    Name=Class_2; 
Male    Name=Class_2; 
Female  Name=Class_2;  
###
Female  Name=Class_3; 
Female  Name=Class_3; 
Male    Name=Class_3.1; 
Male    Name=Class_3.2; 
Female  Name=Class_3;
###

Each name has to be changed to the class they belong to. 每个名称都必须更改为它们所属的类。 I noticed that in the dataset, each new class in the list is denoted by a '###'. 我注意到在数据集中,列表中的每个新类都用'###'表示。 So I can split the data set into blocks by '###' and count the instances of ###. 所以我可以通过'###'将数据集拆分成块,并计算###的实例。 Then use regex to look for the names, and replace them by the count of ###. 然后使用正则表达式查找名称,并将其替换为###的计数。

My code looks like this: 我的代码看起来像这样:

blocks = [b.strip() for b in open('/file', 'r').readlines()]
pattern = r'Name=(.*?)[;/]'
prefix = 'Class_'
triple_hash_count = 1

for line in blocks:
    match = re.findall(pattern, line)
    print match

for line in blocks:
    if line == '###':
        triple_hash_count += 1
        print line 
    else: 
        print(line.replace(match, prefix + str(triple_hash_count))) 

This doesn't seem to do the job - no replacements are made. 这似乎不起作用 - 没有替换。

When running the code you provided, I got the following traceback output: 运行您提供的代码时,我得到以下回溯输出:

print(line.replace(match, prefix + str(triple_hash_count))) 
TypeError: Can't convert 'list' object to str implicitly

The error happens because type(match) evaluates to a list. 发生错误是因为type(match)评估为列表。 When I inspect this list in PDB, it's an empty list. 当我在PDB中检查此列表时,它是一个空列表。 This is because match has gone out of scope by having two for-loops. 这是因为match已超出范围,有两个for循环。 So let's combine them as such: 所以让我们把它们结合起来:

for line in blocks:
    match = re.findall(pattern, line)
    print(match)

    if line == '###':
        triple_hash_count += 1
        print(line) 
    else: 
        print(line.replace(match, prefix + str(triple_hash_count)))

Now you're getting content in match , but there's still a problem: the return type of re.findall is a list of strings. 现在你在match获得内容,但仍然存在一个问题: re.findall的返回类型是一个字符串列表。 str.replace(...) expects a single string as its first argument. str.replace(...)期望单个字符串作为其第一个参数。

You could cheat, and change the offending line to print(line.replace(match[0], prefix + str(triple_hash_count))) -- but that presumes that you're sure you're going to find a regular expression match on every line that isn't ### . 你可以作弊,并改变要print(line.replace(match[0], prefix + str(triple_hash_count)))的违规行print(line.replace(match[0], prefix + str(triple_hash_count))) - 但这假设您确定要找到正则表达式匹配每一行都不是### A more resilient way would be to check to see that you have the match before you try to call str.replace() on it. 一种更有弹性的方法是在尝试调用str.replace()之前检查是否有匹配。

The final code looks like this: 最终代码如下所示:

for line in blocks:
    match = re.findall(pattern, line)
    print(match)

    if line == '###':
        triple_hash_count += 1
        print(line) 
    else:
        if match: 
            print(line.replace(match[0], prefix + str(triple_hash_count)))
        else:
            print(line)

Two more things: 还有两件事:

  1. On line 11, you mistook the variable name. 在第11行,您误认为变量名称。 It's triple_hash_count , not hash_count . 它是triple_hash_count ,而不是hash_count
  2. This code won't actually change the text file provided as input on line 1. You need to write the result of line.replace(match, prefix + str(triple_hash_count)) back to the file, not just print it. 此代码实际上不会更改作为第1行输入提供的文本文件。您需要将line.replace(match, prefix + str(triple_hash_count))的结果写回文件,而不仅仅是打印它。

The problem is rooted in the use of a second loop (as well as a mis-named variable). 问题源于使用第二个循环(以及错误命名的变量)。 This will work. 这会奏效。

import re

blocks = [b.strip() for b in open('/file', 'r').readlines()]
pattern = r'Name=([^\.\d;]*)'
prefix = 'Class_'
triple_hash_count = 1

for line in blocks:

    if line == '###':
        triple_hash_count += 1
        print line     
    else:
        match = re.findall(pattern, line)
        print line.replace(match[0], prefix + str(triple_hash_count)) 

While you already have your answer, you can do it in just a couple of lines with regular expressions (it could even be a one-liner but this is not very readable): 虽然你已经有了答案,但你可以用几行来表达正常的表达式(它甚至可以是单行,但这不是很易读):

import re
hashrx = re.compile(r'^###$', re.MULTILINE)
namerx = re.compile(r'Name=\w+(\.\d+)?;')

new_string = '###'.join([namerx.sub(r"Name=Class_{}\1".format(idx + 1), part) 
                for idx,part in enumerate(hashrx.split(string))])
print(new_string)

What it does: 它能做什么:

  1. First, it looks for ### in a single line with the anchors ^ and $ in MULTILINE mode. 首先,它在一行中查找### ,并且在MULTILINE模式下使用锚点^$
  2. Second, it looks for a possible number after the Name , capturing it into group 1 (but made optional as not all of your names have it). 其次,它在Name之后查找可能的数字,将其捕获到组1中(但是由于并非所有名称都具有可选项,因此可以选择)。
  3. Third, it splits your string by ### and iterates over it with enumerate() , thus having a counter for the numbers to be inserted. 第三,它通过###拆分你的字符串并用enumerate()迭代它,因此有一个计数器用于插入数字。
  4. Lastly, it joins the resulting list by ### again. 最后,它再次通过###加入结果列表。

As a one-liner (though not advisable): 作为一个单行(虽然不可取):

new_string = '###'.join(
                [re.sub(r'Name=\w+(\.\d+)?;', r"Name=Class_{}\1".format(idx + 1), part) 
                for idx, part in enumerate(re.split(r'^###$', string, flags=re.MULTILINE))])

Demo 演示

A demo says more than thousands words. 一个演示说了超过几千个单词。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM