使用正则表达式从xml标记中删除选择字符

Question

I am trying to remove only select characters from xml tags + any digit that follows + the proceeding : .. for example: <ns2:projectArea alias= should look like <projectArea alias= and <ns9:name> should look like <name> 我试图从xml标记中删除仅选择字符+后面的任何数字+进行如下操作: ..例如： <ns2:projectArea alias=应该看起来像<projectArea alias=并且<ns9:name>应该看起来像<name>

Basically, the digit will be random (anything from 1-9) and there will always be a proceeding : that must be deleted. 基本上，数字将是随机的（1到9之间的任何数字），并且始终会有一个过程:必须将其删除。

What I have so far is: 到目前为止，我有：

import argparse
import re

# Initiates argument
parser = argparse.ArgumentParser()

parser.add_argument("--input", "-i", help="Set the input xml to clean up")
parser.add_argument("--output", "-o", help="Set the output xml location")

args = parser.parse_args()
inputfile = args.input
outputfile = args.output
if args.input:
  print("inputfile location is %s" % args.input)
if args.output:
  print("outputfile location is %s" % args.output)
# End argument

text = re.sub('<[^<]+>', "", open(inputfile).read())
with open(outputfile, "w") as f:
    f.write(text)

This piece of the code is the issue: '<[^<]+>' It deletes entire tags, so if i need to search text later on, basically have to search plain text rather than by tags. 这段代码就是问题： '<[^<]+>'它会删除整个标签，因此，如果以后需要搜索文本，则基本上必须搜索纯文本而不是标签。

What can I replace '<[^<]+>' with that will delete ns + the following number (whatever number it may be) + the : that follows it? 我可以用什么替换'<[^<]+>'来删除ns +以下数字（可能是任何数字）+后面的: ？

Answer 1

It might be happening because of the regex expression. 由于正则表达式可能会发生这种情况。 Try using this regex expression instead: 尝试改用以下正则表达式：

   text = re.sub('^<[a-zA-Z0-9]+:','<',open(inputfile).read())

Answer 2

This works : 这有效：

https://regex101.com/r/yRhMI9/1 https://regex101.com/r/yRhMI9/1

Readable version : 可读版本：

 <
 (?:
      (?:
           ( /? )                        # (1)
           \w+ [1-9] :
           ( \w+ \s* /? )                # (2)
      )
   |  (?:
           \w+ [1-9] :
           (                             # (3 start)
                \w+ \s+ 
                (?: " [\S\s]*? " | ' [\S\s]*? ' | [^>]? )+
                \s* /?
           )                             # (3 end)
      )
 )
 >

Answer 3

Regex: (?:(?<=<)|(?<=<\\/))(ns[0-9]+:)(?=[^>]*?>) 正则表达式： (?:(?<=<)|(?<=<\\/))(ns[0-9]+:)(?=[^>]*?>)

Demo 演示

使用正则表达式从xml标记中删除选择字符

问题描述

3 个解决方案

解决方案1
1 2018-05-18 20:19:14

解决方案2
0

解决方案3
0 2018-05-19 01:54:04

使用正则表达式从xml标记中删除选择字符

问题描述

3 个解决方案

解决方案1 1 2018-05-18 20:19:14

解决方案2 0

解决方案3 0 2018-05-19 01:54:04

解决方案1
1 2018-05-18 20:19:14

解决方案2
0

解决方案3
0 2018-05-19 01:54:04