[英]How to replace a pattern using regex in python?
I have a dataset that looks like this: 我有一个如下所示的数据集:
Male Name=Tony;
Female Name=Alice.1;
Female Name=Alice.2;
Male Name=Ben;
Male Name=Shankar;
Male Name=Bala;
Female Name=Nina;
###
Female Name=Alex.1;
Female Name=Alex.2;
Male Name=James;
Male Name=Graham;
Female Name=Smith;
###
Female Name=Xing;
Female Name=Flora;
Male Name=Steve.1;
Male Name=Steve.2;
Female Name=Zac;
###
I want to the change the list so it looks like this: 我想更改列表,所以它看起来像这样:
Male Name=Class_1;
Female Name=Class_1.1;
Female Name=Class_1.2;
Male Name=Class_1;
Male Name=Class_1;
Male Name=Class_1;
Female Name=Class_1;
###
Female Name=Class_2.1;
Female Name=Class_2.2;
Male Name=Class_2;
Male Name=Class_2;
Female Name=Class_2;
###
Female Name=Class_3;
Female Name=Class_3;
Male Name=Class_3.1;
Male Name=Class_3.2;
Female Name=Class_3;
###
Each name has to be changed to the class they belong to. 每个名称都必须更改为它们所属的类。 I noticed that in the dataset, each new class in the list is denoted by a '###'.
我注意到在数据集中,列表中的每个新类都用'###'表示。 So I can split the data set into blocks by '###' and count the instances of ###.
所以我可以通过'###'将数据集拆分成块,并计算###的实例。 Then use regex to look for the names, and replace them by the count of ###.
然后使用正则表达式查找名称,并将其替换为###的计数。
My code looks like this: 我的代码看起来像这样:
blocks = [b.strip() for b in open('/file', 'r').readlines()]
pattern = r'Name=(.*?)[;/]'
prefix = 'Class_'
triple_hash_count = 1
for line in blocks:
match = re.findall(pattern, line)
print match
for line in blocks:
if line == '###':
triple_hash_count += 1
print line
else:
print(line.replace(match, prefix + str(triple_hash_count)))
This doesn't seem to do the job - no replacements are made. 这似乎不起作用 - 没有替换。
When running the code you provided, I got the following traceback output: 运行您提供的代码时,我得到以下回溯输出:
print(line.replace(match, prefix + str(triple_hash_count)))
TypeError: Can't convert 'list' object to str implicitly
The error happens because type(match)
evaluates to a list. 发生错误是因为
type(match)
评估为列表。 When I inspect this list in PDB, it's an empty list. 当我在PDB中检查此列表时,它是一个空列表。 This is because
match
has gone out of scope by having two for-loops. 这是因为
match
已超出范围,有两个for循环。 So let's combine them as such: 所以让我们把它们结合起来:
for line in blocks:
match = re.findall(pattern, line)
print(match)
if line == '###':
triple_hash_count += 1
print(line)
else:
print(line.replace(match, prefix + str(triple_hash_count)))
Now you're getting content in match
, but there's still a problem: the return type of re.findall
is a list of strings. 现在你在
match
获得内容,但仍然存在一个问题: re.findall
的返回类型是一个字符串列表。 str.replace(...)
expects a single string as its first argument. str.replace(...)
期望单个字符串作为其第一个参数。
You could cheat, and change the offending line to print(line.replace(match[0], prefix + str(triple_hash_count)))
-- but that presumes that you're sure you're going to find a regular expression match on every line that isn't ###
. 你可以作弊,并改变要
print(line.replace(match[0], prefix + str(triple_hash_count)))
的违规行print(line.replace(match[0], prefix + str(triple_hash_count)))
- 但这假设您确定要找到正则表达式匹配每一行都不是###
。 A more resilient way would be to check to see that you have the match before you try to call str.replace()
on it. 一种更有弹性的方法是在尝试调用
str.replace()
之前检查是否有匹配。
The final code looks like this: 最终代码如下所示:
for line in blocks:
match = re.findall(pattern, line)
print(match)
if line == '###':
triple_hash_count += 1
print(line)
else:
if match:
print(line.replace(match[0], prefix + str(triple_hash_count)))
else:
print(line)
Two more things: 还有两件事:
triple_hash_count
, not hash_count
. triple_hash_count
,而不是hash_count
。 line.replace(match, prefix + str(triple_hash_count))
back to the file, not just print it. line.replace(match, prefix + str(triple_hash_count))
的结果写回文件,而不仅仅是打印它。 The problem is rooted in the use of a second loop (as well as a mis-named variable). 问题源于使用第二个循环(以及错误命名的变量)。 This will work.
这会奏效。
import re
blocks = [b.strip() for b in open('/file', 'r').readlines()]
pattern = r'Name=([^\.\d;]*)'
prefix = 'Class_'
triple_hash_count = 1
for line in blocks:
if line == '###':
triple_hash_count += 1
print line
else:
match = re.findall(pattern, line)
print line.replace(match[0], prefix + str(triple_hash_count))
While you already have your answer, you can do it in just a couple of lines with regular expressions (it could even be a one-liner but this is not very readable): 虽然你已经有了答案,但你可以用几行来表达正常的表达式(它甚至可以是单行,但这不是很易读):
import re
hashrx = re.compile(r'^###$', re.MULTILINE)
namerx = re.compile(r'Name=\w+(\.\d+)?;')
new_string = '###'.join([namerx.sub(r"Name=Class_{}\1".format(idx + 1), part)
for idx,part in enumerate(hashrx.split(string))])
print(new_string)
###
in a single line with the anchors ^
and $
in MULTILINE
mode. ###
,并且在MULTILINE
模式下使用锚点^
和$
。 Name
, capturing it into group 1 (but made optional as not all of your names have it). Name
之后查找可能的数字,将其捕获到组1中(但是由于并非所有名称都具有可选项,因此可以选择)。 ###
and iterates over it with enumerate()
, thus having a counter for the numbers to be inserted. ###
拆分你的字符串并用enumerate()
迭代它,因此有一个计数器用于插入数字。 ###
again. ###
加入结果列表。 new_string = '###'.join(
[re.sub(r'Name=\w+(\.\d+)?;', r"Name=Class_{}\1".format(idx + 1), part)
for idx, part in enumerate(re.split(r'^###$', string, flags=re.MULTILINE))])
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.