简体   繁体   English

使用正则表达式从xml标记中删除选择字符

[英]Remove select characters from xml tags using regex

I am trying to remove only select characters from xml tags + any digit that follows + the proceeding : .. for example: <ns2:projectArea alias= should look like <projectArea alias= and <ns9:name> should look like <name> 我试图从xml标记中删除仅选择字符+后面的任何数字+进行如下操作: ..例如: <ns2:projectArea alias=应该看起来像<projectArea alias=并且<ns9:name>应该看起来像<name>

Basically, the digit will be random (anything from 1-9) and there will always be a proceeding : that must be deleted. 基本上,数字将是随机的(1到9之间的任何数字),并且始终会有一个过程:必须将其删除。

What I have so far is: 到目前为止,我有:

import argparse
import re

# Initiates argument
parser = argparse.ArgumentParser()

parser.add_argument("--input", "-i", help="Set the input xml to clean up")
parser.add_argument("--output", "-o", help="Set the output xml location")

args = parser.parse_args()
inputfile = args.input
outputfile = args.output
if args.input:
  print("inputfile location is %s" % args.input)
if args.output:
  print("outputfile location is %s" % args.output)
# End argument

text = re.sub('<[^<]+>', "", open(inputfile).read())
with open(outputfile, "w") as f:
    f.write(text)

This piece of the code is the issue: '<[^<]+>' It deletes entire tags, so if i need to search text later on, basically have to search plain text rather than by tags. 这段代码就是问题: '<[^<]+>'它会删除整个标签,因此,如果以后需要搜索文本,则基本上必须搜索纯文本而不是标签。

What can I replace '<[^<]+>' with that will delete ns + the following number (whatever number it may be) + the : that follows it? 我可以用什么替换'<[^<]+>'来删除ns +以下数字(可能是任何数字)+后面的:

It might be happening because of the regex expression. 由于正则表达式可能会发生这种情况。 Try using this regex expression instead: 尝试改用以下正则表达式:

   text = re.sub('^<[a-zA-Z0-9]+:','<',open(inputfile).read())

This works : 这有效:

Find r"<(?:(?:(/?)\\w+[1-9]:(\\w+\\s*/?))|(?:\\w+[1-9]:(\\w+\\s+(?:\\"[\\S\\s]*?\\"|'[\\S\\s]*?'|[^>]?)+\\s*/?)))>" 找到r"<(?:(?:(/?)\\w+[1-9]:(\\w+\\s*/?))|(?:\\w+[1-9]:(\\w+\\s+(?:\\"[\\S\\s]*?\\"|'[\\S\\s]*?'|[^>]?)+\\s*/?)))>"
Replace <$1$2$3> 替换<$1$2$3>

https://regex101.com/r/yRhMI9/1 https://regex101.com/r/yRhMI9/1

Readable version : 可读版本:

 <
 (?:
      (?:
           ( /? )                        # (1)
           \w+ [1-9] :
           ( \w+ \s* /? )                # (2)
      )
   |  (?:
           \w+ [1-9] :
           (                             # (3 start)
                \w+ \s+ 
                (?: " [\S\s]*? " | ' [\S\s]*? ' | [^>]? )+
                \s* /?
           )                             # (3 end)
      )
 )
 >

Regex: (?:(?<=<)|(?<=<\\/))(ns[0-9]+:)(?=[^>]*?>) 正则表达式: (?:(?<=<)|(?<=<\\/))(ns[0-9]+:)(?=[^>]*?>)

Demo 演示

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM