简体   繁体   中英

Escaping ampersand and other special characters in XML using Python

I have to parse a large amount of XML files and write it to a text file. However, some of the XML files contain special/illegal characters. I have tried using different methods such as Escaping strings for use in XML but I could not get it to work. I know 1 way to solve this is by editing the XML file itself, but there are thousands of files.

import os
from xml.etree import ElementTree
from xml.sax.saxutils import escape

fileNum = 0;
saveFile = open('NewYork_1.txt','w')

for path, dirs, files in os.walk("NewYork_1"):
   for f in files:
      fileName = os.path.join(path, f)
      with open(fileName,'r', encoding='utf-8') as myFile:
        # print(myFile.read())
        if "&" in myFile:
            myFile = myFile.replace("&", "&") #This does not work.

Also, some of the XML files also have unicode in them. There are emojis.

<Thread>
   <ThreadID></ThreadID>
   <Title></Title>
   <InitPost>
      <UserID></UserID>
      <Date></Date>
      <icontent></icontent>
   </InitPost>
   <Post>
      <UserID></UserID>
      <Date></Date>
      <rcontent></rcontent>
  </Post>
</Thread>

Try this:

import os
from xml.etree import ElementTree
from xml.sax.saxutils import escape

fileNum = 0;
saveFile = open('NewYork_1.txt','w')
saveFile.close()
for path, dirs, files in os.walk("NewYork_1"):
   for f in files:
       fileName = os.path.join(path, f)
       with open(fileName,'a', encoding='utf-8') as myFile:
           myFile=myFile.read()
           if "&" in myFile:
                myFile = myFile.replace("&", "&#38;")

I personally would generate a list of files to read and iterate through that list rather than use os.walk (if you're getting the list from a previous function or a separate script, you could always create a text file with each txt file on a separate line and iterate through lines rather than getting from a variable to save RAM), but to each his own.

As I said, though, I'd discard the whole idea of replacing special characters and use bs4 to open the files, search for which elements you're looking for, and grab from there.

import bs4
list_of_USER_IDs=[]
with open(fileName,'r', encoding='utf-8') as myFile:
    a=myFile.read()
    b=bs4.beautifulSoup(a)
    for elem in b.findAll('USERID'):
        list_of_USER_IDs.append(elem)

That returns the data between the USERID tags, but it'll work for whatever tag you want data from. No need to directly parse xml. It's really not much different than HTML and beautifulSoup is made for that, so why reinvent the wheel?

One more thing...

You can use BeautifulSoup to just find the parts of the xml file that you're extracting data from. Would stagger processing as-needed rather than loading the entire file into a variable and then iterating element by element.

For example:

If your xml file has tags such as:

<pos reviews>
<review_id>001<review_id>
<reviewer_name>John Stamos (From full house)</reviewer_name>
<review_text>This is a great item</review_text>

You could just beautifulsoup.findAll(whatever tag) and just grab that data. I don't know what you're doing specifically, though. Figured it was worth suggesting.

What you can do is read the file in as a string.

file_str = open("xml_file.xml", "r+")

Then use the fromString function of xml.etree.ElementTree

tree = xml.etree.ElementTree.fromString(file_str)

Then do stuff with it.

root = tree.getRoot()

This way you don't have to worry about writing code for every emoji, and invalid sign. If you want to write emojis and special utf-8 characters to xml, you have to add this line to the top.

 <?xml version="1.0" encoding="utf-8" ?>

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM