简体   繁体   中英

How to extract node information conditional to the information of a sibling node using python?

I have a list with personId of interest:

agents = {'id': ['20','32','12']}

Then I have an XML file with household characteristics:

<households
    <household id="980921">
        <members>
            <personId refId="5"/>
            <personId refId="15"/>
            <personId refId="20"/>
        </members>
        <income currency="CHF" period="month">
                8000.0
        </income>
        <attributes>
            <attribute name="numberOfCars" class="java.lang.String" >2</attribute>
        </attributes>

    </household>
    <household id="980976">
        <members>
            <personId refId="2891"/>
            <personId refId="100"/>
            <personId refId="2044"/>
        </members>
        <income currency="CHF" period="month">
                8000.0
        </income>
        <attributes>
            <attribute name="numberOfCars" class="java.lang.String" >1</attribute>
        </attributes>

    </household>
    <household id="980983">
        <members>
            <personId refId="11110"/>
            <personId refId="32"/>
            <personId refId="34"/>
        </members>
        <income currency="CHF" period="month">
                10000.0
        </income>
        <attributes>
            <attribute name="numberOfCars" class="java.lang.String" >0</attribute>
        </attributes>

    </household>
</households>

What I want is to have a data frame, which shows me the income of the households, which house a member which belongs to the list of agents which are of interest. Something like this (a plus would be an additional column which indicates the count of members of the household which houses a person of interest):

personId    income
20          8000.0
32          10000.0

My approach did not really get too far. I have difficulties how to filter for the members and then access info from a "sibling" node. My output is an empty data frame.

import xml.etree.ElementTree as ET
import pandas as pd

with open(xml) as fd:
    root = ET.parse(fd).getroot()

xpath_fmt = 'household/members/personId[@refId="{}"]/income'
rows = []
for pid in agents['id']:
    xpath = xpath_fmt.format(pid)
    r = root.findall(xpath)
    for res in r:
        rows.append([pid, res.text])
d = pd.DataFrame(rows, columns=['personId', 'income']) 

Thanks a lot for your help!

As stated in the comments, here is the solution using BeautifulSoup ( xml_txt is your XML text from the question):

import pandas as pd
from bs4 import BeautifulSoup

agents = {'id': ['20','32','12']}

soup = BeautifulSoup(xml_txt, 'xml')  #xml_txt is your XML text from the question

css_selector = ','.join('household > members > personId[refId="{}"]'.format(i) for i in agents['id'])

data = {'personId':[], 'income':[]}
for person in soup.select(css_selector):
    data['personId'].append( person['refId'] )
    data['income'].append( person.find_parent('household').find('income').get_text(strip=True) )

df = pd.DataFrame(data)
print(df)

Prints:

  personId   income
0       20   8000.0
1       32  10000.0

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM