简体   繁体   English

如何解析 python 中的大 (6 gb) xml 文件并将一些元素保存在列表或表格中?

[英]How to parse large (6 gb) xml file in python and save some of the elements in a list or table?

I'm trying to parse a xml file (6 gb) using python and save the elements in a list/table.我正在尝试使用 python 解析 xml 文件(6 gb)并将元素保存在列表/表格中。 I want to save into a table, ideally a dataframe the concatenation of the First name, middle name and surname in order to have a full name if it is available, and the active status.我想将名字、中间名和姓氏的串联保存到一个表中,理想情况下是 dataframe,以便有一个全名(如果可用)和活动状态。

<Person id="44855" action="chg" date="26-Aug-2022">
<Gender>Male</Gender>
<ActiveStatus>Active</ActiveStatus>
<Deceased>No</Deceased>
<NameDetails>
 <Name NameType="Primary Name">
   <NameValue>
      <FirstName>*****</FirstName>
      <Surname>******</Surname>
   </NameValue>
 </Name>
 <Name NameType="Low Quality AKA">
  <NameValue>
    <FirstName>****</FirstName>
  </NameValue>
 </Name>
 <Name NameType="Spelling Variation">
  <NameValue>
   <FirstName>*****</FirstName>
   <Surname>****</Surname>
  </NameValue>
 </Name>
</NameDetails>
</Person>

I've tried to parse it using xml.etree.ElementTree parse but i got memory error, i've tried pandas read_xml, but got memory error. I've tried to parse it using xml.etree.ElementTree parse but i got memory error, i've tried pandas read_xml, but got memory error. I don't have permission to install lxml, so i cant use lxml.etree.我没有安装 lxml 的权限,所以我不能使用 lxml.etree。 This is my first time uploading a question, and im not sure how to do it correctly, but feel free to ask any question, I really need help here.这是我第一次上传问题,我不确定如何正确地做,但请随时提出任何问题,我真的需要帮助。

The code i have so far is this (it was reused from a similar question on thi plataform)到目前为止我的代码是这个(它是从一个类似的问题中重复使用的)

import xml.sax
from xml.sax.handler import ContentHandler

class XmlHandler( xml.sax.ContentHandler ):
   def __init__(self):
      self.CurrentData = ""
      self.name = ""
      self.namevalue = ""
      self.firstname = ""
      self.middlename = ""
      self.surname = ""
      self.activestatus = ""
      self.data = {} #dict
      self.list = [] #list to store information
      self.list2 = []
      self.list3 = []
      self.list4 = []
      self.list5 = []
      self.list6 = []
      self.list7 = []
      self.list8 = []
      self.list9 = []
      self.list10 = []

   # Call when an element starts
   def startElement(self, tag, attributes):
      self.CurrentData = tag
      if tag == "Person":
         id = attributes["id"]
         self.list.append(id)
         self.data['id'] = self.list

   # Call when an elements ends
   def endElement(self, tag):    
      if self.CurrentData == "Name":
         name = self.name
         self.list3.append(name)
         self.data['Name'] = self.list3
      elif self.CurrentData == "NameValue":
         namevalue = self.namevalue
         self.list4.append(namevalue)
         self.data['NameValue'] = self.list4
      elif self.CurrentData == "FirstName":
         firstname = self.firstname
         self.list5.append(firstname)
         self.data['FirstName'] = self.list5
      elif self.CurrentData == "MiddleName":
         middlename = self.middlename
         self.list6.append(middlename)
         self.data['MiddleName'] = self.list6
      elif self.CurrentData == "Surname":
         surname = self.surname
         self.list7.append(surname)
         self.data['Surname'] = self.list7
         self.list9.append(firstname+' '+middlename+' '+ surname)
         self.data['Full name'] = self.list9
      elif self.CurrentData == "ActiveStatus":
         activestatus = self.activestatus
         self.list8.append(activestatus)
         self.data['ActiveStatus'] = self.list8
      self.CurrentData = ""

   # Call when a character is read
   def characters(self, content):
      if self.CurrentData == "Name":
         self.name = content
      elif self.CurrentData == "ActiveStatus":
         self.activestatus = content
      elif self.CurrentData == "NameValue":
         self.namevalue = content
      elif self.CurrentData == "FirstName":
         self.firstname = content
      elif self.CurrentData == "MiddleName":
         self.middlename = content
      elif self.CurrentData == "Surname":
         self.surname = content
      
if __name__ == '__main__':
  parser = xml.sax.make_parser()
  handler = XmlHandler()
  parser.setContentHandler(handler)
  parser.parse(source,encoding = 'utf8'))

I think at this size it is the easiest to write a custom XML-processor, xml.entree seems really inefficient for larger files.我认为在这种大小下,编写自定义 XML 处理器是最容易的,xml.entre 对于较大的文件似乎效率很低。
Here is an article that shows how this could work. 是一篇文章,展示了它是如何工作的。

Try following power shell script尝试以下电源 shell 脚本

using assembly System.Xml
using assembly System.Xml.Linq 

$Filename = "c:\temp\test.xml"

$reader = [System.Xml.XmlReader]::Create($Filename)
while($reader.EOF -eq $False)
{
    if($reader.Name -ne "Person")
    {
       $reader.ReadToFollowing("Person")
    }
    if($reader.EOF -eq $False)
    {
       $xPerson = [System.Xml.Linq.XElement]::ReadFrom($reader)
       $names = $xPerson.Descendants("Name")
       foreach($name in $Names)
       {
          $nameType = $name.Attribute("NameType").Value
          $firstName = $name.Descendants("FirstName").Value
          $surName = $name.Descendants("Surname").Value
          Write-Host "type = " $nameType, " first = " $firstName, " sur = "$surName
       }
    }  
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM