I want to parse xml on fly from file (1,5gb file) which looks like:
<product product_id="x" name="x" sku_number="x">
<category>
<primary>x</primary>
<secondary>y</secondary>
</category>
<URL>
<product>URL__I_WANT_TO_PULLOUT</product>
<productImage>x</productImage>
</URL>
<description>
<short>x</short>
<long>x</long>
</description>
</product>
I'm using lxml.etree.iterparse
like:
for event, elem in ET.iterparse(f, events=('end',), tag='product'):
save_product(elem)
I get all required values from xml nodes. The only node that I can't pull out is URL>product
(it's just empty). I think that it's caused by same tag name. Is there any way to parse xml on fly, besides iterparse
?
If I run etree.iterparse
on your sample it finds 'product'
tag two times: there is one external and one internal <product>
. The external tag has child elements and its text
is empty. So you need to skip those external 'product'
tags to work only with those that have no child elements, for example:
for event, elem in etree.iterparse(f, events=('end',), tag='product'):
if not len(elem):
save_product(elem)
If it is needed to handle all elements of the top product
tag it is possible to drop all internal product
fields in the main loop and then handle all child elements by path, for example python's lxml and iterparse method :
def save_product(elem):
cat_prim = elem.xpath('category/primary')[0].text;
cat_sec = elem.xpath('category/secondary')[0].text;
url_prod = elem.xpath('URL/product')[0].text;
url_img = elem.xpath('URL/productImage')[0].text;
desc_short = elem.xpath('description/short')[0].text;
desc_long = elem.xpath('description/long')[0].text;
for event, elem in etree.iterparse(f, events=('end',), tag='product'):
if len(elem):
save_product(elem)
I know it might be quite late but for anyone out there I used a following solution:
file_contents = xml_file.read()
xml_obj = etree.fromstring(file_contents)
context = xml_obj.xpath(tag)
where my tag
variable was was the path to the product eg //parent/product
. Then you can use the context container to do something with your elements.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.