使用正则表达式从xml标记中删除选择字符

Question

我试图从xml标记中删除仅选择字符+后面的任何数字+进行如下操作: ..例如： <ns2:projectArea alias=应该看起来像<projectArea alias=并且<ns9:name>应该看起来像<name>

基本上，数字将是随机的（1到9之间的任何数字），并且始终会有一个过程:必须将其删除。

到目前为止，我有：

import argparse
import re

# Initiates argument
parser = argparse.ArgumentParser()

parser.add_argument("--input", "-i", help="Set the input xml to clean up")
parser.add_argument("--output", "-o", help="Set the output xml location")

args = parser.parse_args()
inputfile = args.input
outputfile = args.output
if args.input:
  print("inputfile location is %s" % args.input)
if args.output:
  print("outputfile location is %s" % args.output)
# End argument

text = re.sub('<[^<]+>', "", open(inputfile).read())
with open(outputfile, "w") as f:
    f.write(text)

这段代码就是问题： '<[^<]+>'它会删除整个标签，因此，如果以后需要搜索文本，则基本上必须搜索纯文本而不是标签。

我可以用什么替换'<[^<]+>'来删除ns +以下数字（可能是任何数字）+后面的: ？

Answer 1

由于正则表达式可能会发生这种情况。 尝试改用以下正则表达式：

   text = re.sub('^<[a-zA-Z0-9]+:','<',open(inputfile).read())

Answer 2

这有效：

找到r"<(?:(?:(/?)\\w+[1-9]:(\\w+\\s*/?))|(?:\\w+[1-9]:(\\w+\\s+(?:\\"[\\S\\s]*?\\"|'[\\S\\s]*?'|[^>]?)+\\s*/?)))>"
替换<$1$2$3>

https://regex101.com/r/yRhMI9/1

可读版本：

 <
 (?:
      (?:
           ( /? )                        # (1)
           \w+ [1-9] :
           ( \w+ \s* /? )                # (2)
      )
   |  (?:
           \w+ [1-9] :
           (                             # (3 start)
                \w+ \s+ 
                (?: " [\S\s]*? " | ' [\S\s]*? ' | [^>]? )+
                \s* /?
           )                             # (3 end)
      )
 )
 >

Answer 3

正则表达式： (?:(?<=<)|(?<=<\\/))(ns[0-9]+:)(?=[^>]*?>)

演示

使用正则表达式从xml标记中删除选择字符

问题描述

3 个解决方案

解决方案1
1 2018-05-18 20:19:14

解决方案2
0

解决方案3
0 2018-05-19 01:54:04

使用正则表达式从xml标记中删除选择字符

问题描述

3 个解决方案

解决方案1 1 2018-05-18 20:19:14

解决方案2 0

解决方案3 0 2018-05-19 01:54:04

解决方案1
1 2018-05-18 20:19:14

解决方案2
0

解决方案3
0 2018-05-19 01:54:04