简体   繁体   English

如何使用 Python 和 SAX 更好地读取和解析 xml 文件?

[英]How to better read and parse an xml file using Python and SAX?

Windows 11/Python 3.8.10 - Using Spyder Python IDE and PyCharm Windows 11/Python 3.8.10 - 使用 Spyder Python IDE 和 PyCharm

Hey all, newish to python app dev and have a big project to parse xml files.大家好,刚接触 python app dev 并且有一个大项目来解析 xml 文件。 Trying to write a python program for it.试图为它编写一个 python 程序。 Below is a very small sample of the xml file data structure I am working with.下面是我正在使用的 xml 文件数据结构的一个非常小的示例。

     <PillCall XMLInstanceID="98089D9A-768A-4FA0-A7CD-DC5966EB5B06" PillCallID="49" VersionNumber="1.2">
     </PillCall>

These xml files will be huge.这 xml 个文件会很大。 Eventually this will need to be able to process multiple large files with a lot of data 24/7 concurrently.最终,这将需要能够 24/7 全天候同时处理多个包含大量数据的大文件。 Eventually parsing the data and saving it to a db, then after modification, creating an new modified xml file based on the current data in db.最终解析数据并保存到db中,然后修改后,根据db中的当前数据创建一个新的修改后的xml文件。

Here is my sample program, from Python Spyder IDE: -- I have tried a bunch of other methods but the SAX method has been the best to understand for me personally so far.这是我的示例程序,来自 Python Spyder IDE: -- 我尝试了很多其他方法,但到目前为止,SAX 方法对我个人来说是最好理解的。 I am sure there are better ways though.我相信有更好的方法。

     import xml.sax

class XMLHandler(xml.sax.ContentHandler):
    def __init__(self):
        self.CurrentData = ""
        self.pillcall = ""
        self. pillcallid= ""
        self.vernum = ""

   # Call when an element starts
    def startElement(self, tag, attributes):
        self.CurrentData = tag
        if(tag == "PillCall"):
            print("*****PillCall*****")
            title = attributes["XMLInstanceID"]
            print("XMLInstanceID:=", title) #How at add multiple values/strings here?   
   #        print(sorted()


# create an XMLReader
parser = xml.sax.make_parser()

# turn off namepsaces
parser.setFeature(xml.sax.handler.feature_namespaces, 0)

# override the default ContextHandler
Handler = XMLHandler()
parser.setContentHandler( Handler )
parser.parse("xmltest10.xml")

My output is this:我的 output 是这样的:

PillCall XMLInstanceID:= 98089D9A-768A-4FA0-A7CD-DC5966EB5B06 PillCall XMLInstanceID:= 98089D9A-768A-4FA0-A7CD-DC5966EB5B06

I have tried many different ways to read the whole string with element tree and beautifulsoap but can't get it to work.我尝试了许多不同的方法来使用元素树和 beautifulsoap 读取整个字符串,但无法让它工作。 I also get no output with running this program in PyCharm.我在 PyCharm 中运行这个程序也没有得到 output。

Here is some extra python/sax code that I have been messing with as well but haven't got it to work right either.这是一些额外的 python/sax 代码,我也一直在搞乱,但也没有让它正常工作。

I just need to be able to clearly read the data and parse it to a new file for now.我现在只需要能够清楚地读取数据并将其解析为新文件即可。 And also how to loop through it and find all the data to ouput.以及如何遍历它并找到要输出的所有数据。 Thanks for any and all help!!感谢您的帮助!

     # Call when an elements ends
    def endElement(self, tag):
         if(self.CurrentData != "/PillCall"):
             print("End of PillCall:", self.pillcall)
         elif(self.CurrentData == "PillCallID"):
             print("PillCallID:=", self.pillcallid)
         elif(self.CurrentData == "VersionNumber"):
             print("VersionNumber:=", self.vernum)
         self.CurrentData = ""

    # Call when a character is read
    def characters(self, content):
         if(self.CurrentData == "PillCall"):
             self.pillcall = content
         elif(self.CurrentData == "qty"):
             self.pillcallid = content
         elif(self.CurrentData == "company"):
             self.vernum = content

Using BeautifulSoup's find_all may be what you're looking for...使用 BeautifulSoup 的find_all可能就是你要找的......

Given:鉴于:

text = """
     <PillCall XMLInstanceID="98089D9A-768A-4FA0-A7CD-DC5966EB5B06" PillCallID="49" VersionNumber="1.2">
     </PillCall>
"""
from bs4 import BeautifulSoup

soup = BeautifulSoup(text, 'xml')

for result in soup.find_all('PillCall'):
    print(result.attrs)

Output: Output:

{'PillCallID': '49',
 'VersionNumber': '1.2',
 'XMLInstanceID': '98089D9A-768A-4FA0-A7CD-DC5966EB5B06'}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM