简体   繁体   中英

lxml iterparse, child with same tag name

I want to parse xml on fly from file (1,5gb file) which looks like:

<product product_id="x" name="x" sku_number="x">
    <category>
        <primary>x</primary>
        <secondary>y</secondary>
    </category>
    <URL>
        <product>URL__I_WANT_TO_PULLOUT</product>
        <productImage>x</productImage>
    </URL>
    <description>
        <short>x</short>
        <long>x</long>
    </description>
</product>

I'm using lxml.etree.iterparse like:

for event, elem in ET.iterparse(f, events=('end',), tag='product'):
    save_product(elem)

I get all required values from xml nodes. The only node that I can't pull out is URL>product (it's just empty). I think that it's caused by same tag name. Is there any way to parse xml on fly, besides iterparse ?

If I run etree.iterparse on your sample it finds 'product' tag two times: there is one external and one internal <product> . The external tag has child elements and its text is empty. So you need to skip those external 'product' tags to work only with those that have no child elements, for example:

for event, elem in etree.iterparse(f, events=('end',), tag='product'):
    if not len(elem):
        save_product(elem)

If it is needed to handle all elements of the top product tag it is possible to drop all internal product fields in the main loop and then handle all child elements by path, for example python's lxml and iterparse method :

def save_product(elem):
    cat_prim = elem.xpath('category/primary')[0].text;
    cat_sec = elem.xpath('category/secondary')[0].text;
    url_prod = elem.xpath('URL/product')[0].text;
    url_img = elem.xpath('URL/productImage')[0].text;
    desc_short = elem.xpath('description/short')[0].text;
    desc_long = elem.xpath('description/long')[0].text;

for event, elem in etree.iterparse(f, events=('end',), tag='product'):
    if len(elem):
        save_product(elem)

I know it might be quite late but for anyone out there I used a following solution:

   file_contents = xml_file.read()
   xml_obj = etree.fromstring(file_contents)
   context = xml_obj.xpath(tag)

where my tag variable was was the path to the product eg //parent/product . Then you can use the context container to do something with your elements.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM