简体   繁体   English

使用 Python lxml 解析 XML

[英]Parse XML with Python lxml

I am trying to parse a XML using the python library lxml , and would like the resulting output to be in a dataframe.我正在尝试使用 python 库lxml解析 XML ,并希望生成的 output 位于 Z45550704B5CDC479 中I am relatively new to python and parsing so please bear with me as I outline the problem.我对 python 和解析比较陌生,所以在我概述问题时请多多包涵。 The original xml that I am trying to parse is available here我尝试解析的原始 xml 可在此处获得

I am interested in obtaining some relevant tags founds in " invstOrSec ".我有兴趣在“ invstOrSec ”中获得一些相关标签。 Below is a snapshot of one instance of " invstOrSec " where the text accompanying the tag " curCd " is USD.下面是“ invstOrSec ”的一个实例的快照,其中标签“ curCd ”附带的文本是美元。

<?xml version="1.0" encoding="UTF-8"?>
    <invstOrSec>
        <name>NIPPON LIFE INSURANCE</name>
        <lei>549300Y0HHMFW3EVWY08</lei>
        <curCd>USD</curCd>
    <invstOrSec>

This is relatively straightforward, and my current approach involves first defining the relevant tags in a dictionary and then coarse them into a dataframe in a loop.这相对简单,我目前的方法是首先在字典中定义相关标签,然后在循环中将它们粗化为 dataframe。

    import pandas as pd
    from lxml import etree

    # Declare directory
    os.chdir('C:/Users/A1610222/Desktop/Form NPORT/pkg/sec-edgar-filings/0001548717/NPORT-P/0001752724- 
    20-040624')

    # Set root
    xmlfile = "filing-details.xml"
    tree = etree.parse(xmlfile)
    root = tree.getroot()

    # Remove namespace prefixes
    for elem in root.getiterator():
        elem.tag = etree.QName(elem).localname
   
    # Remove unused namespace declarations
    etree.cleanup_namespaces(root)

    # Set path
    invstOrSec = root.xpath('//invstOrSec')

    # Define tags to extract
    vars = {'invstOrSec' : {'name', 'lei', 'curCd'}

    # Extract holdings data
    sec_info =  pd.DataFrame()
    temp = pd.DataFrame()

    for one in invstOrSec:
        for two in one:
            if two.tag in vars['invstOrSec']:
                temp[two.tag] = [two.text]
        sec_info = sec_info.append(temp)  

Here are the top three rows of sec_info这是sec_info的前三行

name姓名 lei curCd curCd
NIPPON LIFE INSURANCE日本人寿保险 549300Y0HHMFW3EVWY08 549300Y0HHMFW3EVWY08 USD美元
Lloyds Banking Group PLC劳埃德银行集团 549300PPXHEU2JF0AM85 549300PPXHEU2JF0AM85 USD美元
Enbridge Inc安桥公司 98TPTUM4IVMFCZBCUR27 98TPTUM4IVMFCZBCUR27 USD美元

However, the xml follows a slightly different structure when the currency is not USD.但是,当货币不是美元时,xml 的结构略有不同。 See the below example.请参见下面的示例。

<?xml version="1.0" encoding="UTF-8"?>
    <invstOrSec>
        <name>ACHMEA BV</name>
        <lei>7245007QUMI1FHIQV531</lei>
        <currencyConditional curCd="EUR" exchangeRt="0.89150400"/>
    <invstOrSec>

This time curCd is replaced with a different tag currencyConditional and it contains attributes as opposed to the text.这次curCd被一个不同的标签currencyConditional替换,它包含与文本相反的属性。 I am having a hard time trying to account for these cases while keeping my code as general as possible.我很难解释这些情况,同时保持我的代码尽可能通用。 I hope I have managed to illustrate the problem.我希望我已经设法说明了这个问题。 Again, please excuse me if this is too elementary.再次,如果这太初级,请原谅我。 Any help would be much appreciated.任何帮助将非常感激。

This is one case where you shouldn't try to reinvent the wheel;这是您不应该尝试重新发明轮子的一种情况。 use tools developed by others...使用别人开发的工具...

import pandas as pd
import pandas_read_xml as pdx

url = 'https://www.sec.gov/Archives/edgar/data/1548717/000175272420040624/primary_doc.xml'

df = pdx.read_xml(url,['edgarSubmission', 'formData', 'invstOrSecs','invstOrSec'])

#because of the non-US currency column, you have to apply one more contortion:
df['currencyConditional'] = df['currencyConditional'].apply(lambda x: x.get('@curCd') if not isinstance(x,float) else "NA" )
df[['name','lei','curCd','currencyConditional']]

Output (partial, obviously) - note the last row: Output(部分,显然)-注意最后一行:

168     BNP PARIBAS     R0MUWSFPU8MPRO8K5P83    USD     NA
169     Societe Generale    O2RNE8IBXP4R0TD8PU41    USD     NA
170     BARCLAYS PLC    213800LBQA1Y9L22JB70    NaN     GBP

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM