[英]Remove select characters from xml tags using regex
I am trying to remove only select characters from xml tags + any digit that follows + the proceeding :
.. for example: <ns2:projectArea alias=
should look like <projectArea alias=
and <ns9:name>
should look like <name>
我试图从xml标记中删除仅选择字符+后面的任何数字+进行如下操作
:
..例如: <ns2:projectArea alias=
应该看起来像<projectArea alias=
并且<ns9:name>
应该看起来像<name>
Basically, the digit will be random (anything from 1-9) and there will always be a proceeding :
that must be deleted. 基本上,数字将是随机的(1到9之间的任何数字),并且始终会有一个过程
:
必须将其删除。
What I have so far is: 到目前为止,我有:
import argparse
import re
# Initiates argument
parser = argparse.ArgumentParser()
parser.add_argument("--input", "-i", help="Set the input xml to clean up")
parser.add_argument("--output", "-o", help="Set the output xml location")
args = parser.parse_args()
inputfile = args.input
outputfile = args.output
if args.input:
print("inputfile location is %s" % args.input)
if args.output:
print("outputfile location is %s" % args.output)
# End argument
text = re.sub('<[^<]+>', "", open(inputfile).read())
with open(outputfile, "w") as f:
f.write(text)
This piece of the code is the issue: '<[^<]+>'
It deletes entire tags, so if i need to search text later on, basically have to search plain text rather than by tags. 这段代码就是问题:
'<[^<]+>'
它会删除整个标签,因此,如果以后需要搜索文本,则基本上必须搜索纯文本而不是标签。
What can I replace '<[^<]+>'
with that will delete ns
+ the following number (whatever number it may be) + the :
that follows it? 我可以用什么替换
'<[^<]+>'
来删除ns
+以下数字(可能是任何数字)+后面的:
?
It might be happening because of the regex expression. 由于正则表达式可能会发生这种情况。 Try using this regex expression instead:
尝试改用以下正则表达式:
text = re.sub('^<[a-zA-Z0-9]+:','<',open(inputfile).read())
This works : 这有效:
Find r"<(?:(?:(/?)\\w+[1-9]:(\\w+\\s*/?))|(?:\\w+[1-9]:(\\w+\\s+(?:\\"[\\S\\s]*?\\"|'[\\S\\s]*?'|[^>]?)+\\s*/?)))>"
找到
r"<(?:(?:(/?)\\w+[1-9]:(\\w+\\s*/?))|(?:\\w+[1-9]:(\\w+\\s+(?:\\"[\\S\\s]*?\\"|'[\\S\\s]*?'|[^>]?)+\\s*/?)))>"
Replace <$1$2$3>
替换
<$1$2$3>
https://regex101.com/r/yRhMI9/1 https://regex101.com/r/yRhMI9/1
Readable version : 可读版本:
<
(?:
(?:
( /? ) # (1)
\w+ [1-9] :
( \w+ \s* /? ) # (2)
)
| (?:
\w+ [1-9] :
( # (3 start)
\w+ \s+
(?: " [\S\s]*? " | ' [\S\s]*? ' | [^>]? )+
\s* /?
) # (3 end)
)
)
>
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.