简体   繁体   中英

Escape extra quotes in malformed xml

I've malformed xml file that contains extra quotes in a tag. I would like to remove them or replace by &quote. Malformed XML looks looks like:

<CLASS ATT2="PDX"R"088">

My expected result:

<CLASS ATT2="PDX R 088">
or
<CLASS ATT2="PDX&quot;R&quot;088">

I've tried to iterate through all lines and finding ATT first and last indexes but it's quite dirty and produces too much code.

Do anyone have simple solution for this?

This is not 100% foolproof, but might work with a little luck:

re.sub(r'(?<!=)"(?!>)', '&quot;', malformed_xml)

will only replace quotes that are neither preceded by = nor followed by > .

If there could be whitespace after = (or before > ), you can't use the re module anymore, but the regex module (PyPI) can work with this:

regex.sub(r'(?<!=\s*)"(?!\s*>)', '&quot;', malformed_xml)

Not the best solution maybe, but since you cannot parse it with (eg) xml.etree as it is invalid, you can try playing with something like the code below.

It will:

  1. open the file
  2. read it line by line
  3. search for each line if there's a specific string (eg CLASS )
  4. if CLASS is found, find all the occurrences of double quotes ( " )
  5. check if more than two double-quotes are found and replace them with white space
  6. update the lines

WARNING: BACKUP YOUR ORIGINAL FILE AS THIS WILL MODIFY IT!!!

import re

f = open(r'YOUR/FILE/HERE',"r+b")
lines = f.readlines()
for idx, row in enumerate(lines):
     if "CLASS" in row:
         quote_index = [x.start() for x in re.finditer('\"', row)]
         if len(quote_index) > 2:
             replace_quote = quote_index[1:-1]
             correct_row = list(row)
             for quotes in replace_quote:
                 correct_row[quotes] = " "
             new_row = "".join(correct_row)
             lines[idx] = new_row
f.seek(0)
f.truncate()
f.write(''.join(lines))
f.close()

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM