簡體   English   中英

在python中將xml解析為pandas數據框

[英]parse xml to pandas data frame in python

我正在嘗試讀取 XML 文件並將其轉換為熊貓。 但是它返回空數據

這是xml結構的示例:

<Instance ID="1">
<MetaInfo StudentID ="DTSU040" TaskID="LP03_PR09.bLK.sh"  DataSource="DeepTutorSummer2014"/>
<ProblemDescription>A car windshield collides with a mosquito, squashing it.</ProblemDescription>
<Question>How does this work tion?</Question>
<Answer>tthis is my best  </Answer>
<Annotation Label="correct(0)|correct_but_incomplete(1)|contradictory(0)|incorrect(0)">
<AdditionalAnnotation ContextRequired="0" ExtraInfoInAnswer="0"/>
<Comments Watch="1"> The student forgot to tell the opposite force. Opposite means opposite direction, which is important here. However, one can argue that the opposite is implied. See the reference answers.</Comments>
</Annotation>
<ReferenceAnswers>
1:  Since the windshield exerts a force on the mosquito, which we can call action, the mosquito exerts an equal and opposite force on the windshield, called the reaction.

</ReferenceAnswers>
</Instance>

我已經嘗試過這段代碼,但它對我不起作用。 它返回空數據幀。

import pandas as pd 
import xml.etree.ElementTree as et 

xtree = et.parse("grade_data.xml")
xroot = xtree.getroot() 

df_cols = ["ID", "TaskID", "DataSource", "ProblemDescription", 'Question', 'Answer',
           'ContextRequired', 'ExtraInfoInAnswer', 'Comments', 'Watch', 'ReferenceAnswers']
rows = []


for node in xroot: 
    s_name = node.attrib.get("ID")
    s_student = node.find("StudentID") 
    s_task = node.find("TaskID") 
    s_source = node.find("DataSource") 
    s_desc = node.find("ProblemDescription") 
    s_question = node.find("Question") 
    s_ans = node.find("Answer") 
    s_label = node.find("Label") 
    s_contextrequired = node.find("ContextRequired") 
    s_extraInfoinAnswer = node.find("ExtraInfoInAnswer")
    s_comments = node.find("Comments") 
    s_watch = node.find("Watch") 
    s_referenceAnswers = node.find("ReferenceAnswers") 


    rows.append({"ID": s_name,"StudentID":s_student, "TaskID": s_task, 
                 "DataSource": s_source, "ProblemDescription": s_desc , 
                 "Question": s_question , "Answer": s_ans ,"Label": s_label,
                 "s_contextrequired": s_contextrequired , "ExtraInfoInAnswer": s_extraInfoinAnswer ,
                 "Comments": s_comments ,  "Watch": s_watch, "ReferenceAnswers": s_referenceAnswers, 

                })

out_df = pd.DataFrame(rows, columns = df_cols)

您的解決方案中的問題是“元素數據提取”沒有正確完成。 你在問題中提到的xml嵌套在幾個層中。 這就是為什么我們需要遞歸讀取和提取數據的原因。 在這種情況下,以下解決方案應該為您提供所需的內容。 盡管我鼓勵您查看本文Python 文檔以獲得更清晰的信息。

方法:1

import numpy as np
import pandas as pd
#import os
import xml.etree.ElementTree as ET

def xml2df(xml_source, df_cols, source_is_file = False, show_progress=True): 
    """Parse the input XML source and store the result in a pandas 
    DataFrame with the given columns. 

    For xml_source = xml_file, Set: source_is_file = True
    For xml_source = xml_string, Set: source_is_file = False

    <element attribute_key1=attribute_value1, attribute_key2=attribute_value2>
        <child1>Child 1 Text</child1>
        <child2>Child 2 Text</child2>
        <child3>Child 3 Text</child3>
    </element>
    Note that for an xml structure as shown above, the attribute information of 
    element tag can be accessed by list(element). Any text associated with <element> tag can be accessed
    as element.text and the name of the tag itself can be accessed with
    element.tag.
    """
    if source_is_file:
        xtree = ET.parse(xml_source) # xml_source = xml_file
        xroot = xtree.getroot()
    else:
        xroot = ET.fromstring(xml_source) # xml_source = xml_string
    consolidator_dict = dict()
    default_instance_dict = {label: None for label in df_cols}

    def get_children_info(children, instance_dict):
        # We avoid using element.getchildren() as it is deprecated.
        # Instead use list(element) to get a list of attributes.
        for child in children:
            #print(child)
            #print(child.tag)
            #print(child.items())
            #print(child.getchildren()) # deprecated method
            #print(list(child))
            if len(list(child))>0:
                instance_dict = get_children_info(list(child), 
                                                  instance_dict)

            if len(list(child.keys()))>0:
                items = child.items()
                instance_dict.update({key: value for (key, value) in items})             

            #print(child.keys())
            instance_dict.update({child.tag: child.text})
        return instance_dict

    # Loop over all instances
    for instance in list(xroot):
        instance_dict = default_instance_dict.copy()           
        ikey, ivalue = instance.items()[0] # The first attribute is "ID"
        instance_dict.update({ikey: ivalue}) 
        if show_progress:
            print('{}: {}={}'.format(instance.tag, ikey, ivalue))
        # Loop inside every instance
        instance_dict = get_children_info(list(instance), 
                                          instance_dict)   

        #consolidator_dict.update({ivalue: instance_dict.copy()}) 
        consolidator_dict[ivalue] = instance_dict.copy()       
    df = pd.DataFrame(consolidator_dict).T 
    df = df[df_cols]

    return df

運行以下命令以生成所需的輸出。

xml_source = r'grade_data.xml'
df_cols = ["ID", "TaskID", "DataSource", "ProblemDescription", "Question", "Answer",
           "ContextRequired", "ExtraInfoInAnswer", "Comments", "Watch", 'ReferenceAnswers']

df = xml2df(xml_source, df_cols, source_is_file = True)
df

方法:2

鑒於您有xml_string ,您可以轉換xml >> dict >> dataframe 運行以下命令以獲得所需的輸出。

注意:您需要安裝xmltodict才能使用 Method-2。 這種方法的靈感來自@martin-blech 在How to convert XML to JSON in Python? [重復] 感謝@martin-blech的制作。

pip install -U xmltodict

解決方案

def read_recursively(x, instance_dict):  
    #print(x)
    txt = ''
    for key in x.keys():
        k = key.replace("@","")
        if k in df_cols: 
            if isinstance(x.get(key), dict):
                instance_dict, txt = read_recursively(x.get(key), instance_dict)
            #else:                
            instance_dict.update({k: x.get(key)})
            #print('{}: {}'.format(k, x.get(key)))
        else:
            #print('else: {}: {}'.format(k, x.get(key)))
            # dig deeper if value is another dict
            if isinstance(x.get(key), dict):
                instance_dict, txt = read_recursively(x.get(key), instance_dict)                
            # add simple text associated with element
            if k=='#text':
                txt = x.get(key)
        # update text to corresponding parent element    
        if (k!='#text') and (txt!=''):
            instance_dict.update({k: txt})
    return (instance_dict, txt)

您將需要上面給出的函數read_recursively() 現在運行以下命令。

import xmltodict, json

o = xmltodict.parse(xml_string) # INPUT: XML_STRING
#print(json.dumps(o)) # uncomment to see xml to json converted string

consolidated_dict = dict()
oi = o['Instances']['Instance']

for x in oi:
    instance_dict = dict()
    instance_dict, _ = read_recursively(x, instance_dict)
    consolidated_dict.update({x.get("@ID"): instance_dict.copy()})
df = pd.DataFrame(consolidated_dict).T
df = df[df_cols]
df

幾個問題:

  • 在循環變量node上調用.find需要存在一個子節點: current_node.find('child_of_current_node') 但是,由於所有節點都是root的子節點,它們不維護自己的子節點,因此不需要循環;
  • 不檢查可能由find()丟失節點導致的NoneType並阻止檢索.tag.text或其他屬性;
  • 不使用.text檢索節點內容,否則返回<Element...對象;

考慮使用三元條件表達式a if condition else b進行的這種調整,以確保變量具有值,無論如何:

rows = []

s_name = xroot.attrib.get("ID")
s_student = xroot.find("StudentID").text if xroot.find("StudentID") is not None else None
s_task = xroot.find("TaskID").text if xroot.find("TaskID") is not None else None      
s_source = xroot.find("DataSource").text if xroot.find("DataSource") is not None else None
s_desc = xroot.find("ProblemDescription").text if xroot.find("ProblemDescription") is not None else None
s_question = xroot.find("Question").text if xroot.find("Question") is not None else None    
s_ans = xroot.find("Answer").text if xroot.find("Answer") is not None else None
s_label = xroot.find("Label").text if xroot.find("Label") is not None else None
s_contextrequired = xroot.find("ContextRequired").text if xroot.find("ContextRequired") is not None else None
s_extraInfoinAnswer = xroot.find("ExtraInfoInAnswer").text if xroot.find("ExtraInfoInAnswer") is not None else None
s_comments = xroot.find("Comments").text if xroot.find("Comments") is not None else None
s_watch = xroot.find("Watch").text if xroot.find("Watch") is not None else None
s_referenceAnswers = xroot.find("ReferenceAnswers").text if xroot.find("ReferenceAnswers") is not None else None

rows.append({"ID": s_name,"StudentID":s_student, "TaskID": s_task, 
             "DataSource": s_source, "ProblemDescription": s_desc , 
             "Question": s_question , "Answer": s_ans ,"Label": s_label,
             "s_contextrequired": s_contextrequired , "ExtraInfoInAnswer": s_extraInfoinAnswer ,
             "Comments": s_comments ,  "Watch": s_watch, "ReferenceAnswers": s_referenceAnswers     
            })

out_df = pd.DataFrame(rows, columns = df_cols)

或者,使用迭代器變量運行分配給內部字典的更動態版本:

rows = []
for node in xroot: 
    inner = {}
    inner[node.tag] = node.text

    rows.append(inner)

out_df = pd.DataFrame(rows, columns = df_cols)

或列表/字典理解:

rows = [{node.tag: node.text} for node in xroot]
out_df = pd.DataFrame(rows, columns = df_cols)

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM