[英]How to better read and parse an xml file using Python and SAX?
Windows 11/Python 3.8.10 - Using Spyder Python IDE and PyCharm Windows 11/Python 3.8.10 - 使用 Spyder Python IDE 和 PyCharm
Hey all, newish to python app dev and have a big project to parse xml files.大家好,刚接触 python app dev 并且有一个大项目来解析 xml 文件。 Trying to write a python program for it.试图为它编写一个 python 程序。 Below is a very small sample of the xml file data structure I am working with.下面是我正在使用的 xml 文件数据结构的一个非常小的示例。
<PillCall XMLInstanceID="98089D9A-768A-4FA0-A7CD-DC5966EB5B06" PillCallID="49" VersionNumber="1.2">
</PillCall>
These xml files will be huge.这 xml 个文件会很大。 Eventually this will need to be able to process multiple large files with a lot of data 24/7 concurrently.最终,这将需要能够 24/7 全天候同时处理多个包含大量数据的大文件。 Eventually parsing the data and saving it to a db, then after modification, creating an new modified xml file based on the current data in db.最终解析数据并保存到db中,然后修改后,根据db中的当前数据创建一个新的修改后的xml文件。
Here is my sample program, from Python Spyder IDE: -- I have tried a bunch of other methods but the SAX method has been the best to understand for me personally so far.这是我的示例程序,来自 Python Spyder IDE: -- 我尝试了很多其他方法,但到目前为止,SAX 方法对我个人来说是最好理解的。 I am sure there are better ways though.我相信有更好的方法。
import xml.sax
class XMLHandler(xml.sax.ContentHandler):
def __init__(self):
self.CurrentData = ""
self.pillcall = ""
self. pillcallid= ""
self.vernum = ""
# Call when an element starts
def startElement(self, tag, attributes):
self.CurrentData = tag
if(tag == "PillCall"):
print("*****PillCall*****")
title = attributes["XMLInstanceID"]
print("XMLInstanceID:=", title) #How at add multiple values/strings here?
# print(sorted()
# create an XMLReader
parser = xml.sax.make_parser()
# turn off namepsaces
parser.setFeature(xml.sax.handler.feature_namespaces, 0)
# override the default ContextHandler
Handler = XMLHandler()
parser.setContentHandler( Handler )
parser.parse("xmltest10.xml")
My output is this:我的 output 是这样的:
PillCall XMLInstanceID:= 98089D9A-768A-4FA0-A7CD-DC5966EB5B06 PillCall XMLInstanceID:= 98089D9A-768A-4FA0-A7CD-DC5966EB5B06
I have tried many different ways to read the whole string with element tree and beautifulsoap but can't get it to work.我尝试了许多不同的方法来使用元素树和 beautifulsoap 读取整个字符串,但无法让它工作。 I also get no output with running this program in PyCharm.我在 PyCharm 中运行这个程序也没有得到 output。
Here is some extra python/sax code that I have been messing with as well but haven't got it to work right either.这是一些额外的 python/sax 代码,我也一直在搞乱,但也没有让它正常工作。
I just need to be able to clearly read the data and parse it to a new file for now.我现在只需要能够清楚地读取数据并将其解析为新文件即可。 And also how to loop through it and find all the data to ouput.以及如何遍历它并找到要输出的所有数据。 Thanks for any and all help!!感谢您的帮助!
# Call when an elements ends
def endElement(self, tag):
if(self.CurrentData != "/PillCall"):
print("End of PillCall:", self.pillcall)
elif(self.CurrentData == "PillCallID"):
print("PillCallID:=", self.pillcallid)
elif(self.CurrentData == "VersionNumber"):
print("VersionNumber:=", self.vernum)
self.CurrentData = ""
# Call when a character is read
def characters(self, content):
if(self.CurrentData == "PillCall"):
self.pillcall = content
elif(self.CurrentData == "qty"):
self.pillcallid = content
elif(self.CurrentData == "company"):
self.vernum = content
Using BeautifulSoup's find_all
may be what you're looking for...使用 BeautifulSoup 的find_all
可能就是你要找的......
Given:鉴于:
text = """
<PillCall XMLInstanceID="98089D9A-768A-4FA0-A7CD-DC5966EB5B06" PillCallID="49" VersionNumber="1.2">
</PillCall>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(text, 'xml')
for result in soup.find_all('PillCall'):
print(result.attrs)
Output: Output:
{'PillCallID': '49',
'VersionNumber': '1.2',
'XMLInstanceID': '98089D9A-768A-4FA0-A7CD-DC5966EB5B06'}
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.