简体   繁体   中英

Extract elements from XML file using Python

The link below gives us the list of ingredients in recipelist. I would like to extract the names of the ingredient and save it to another file using python. http://stream.massey.ac.nz/file.php/6087/Eva_Material/Tutorials/recipebook.xml

So far I have tried using the following code, but it gives me the complete recipe not the names of the ingredients:

from xml.sax.handler import ContentHandler
import xml.sax
import sys
def recipeBook(): 
    path = "C:\Users\user\Desktop"
    basename = "recipebook.xml"
    filename = path+"\\"+basename
    file=open(filename,"rt")
    # find contents 
    contents = file.read()

    class textHandler(ContentHandler):
      def characters(self, ch):
      sys.stdout.write(ch.encode("Latin-1"))
    parser = xml.sax.make_parser()
    handler = textHandler( )
    parser.setContentHandler(handler)
    parser.parse("C:\Users\user\Desktop\\recipebook.xml")



  file.close()

How do I extract the name of each ingredient and save them to another file?

@Neha

I guess you have solved your request by now, here is a little piece I put together using the tutorial at http://lxml.de/tutorial.html . The XML file is saved in 'rough_data.xml'

import xml.etree.cElementTree as etree

xmlDoc = open('rough_data.xml', 'r')
xmlDocData = xmlDoc.read()
xmlDocTree = etree.XML(xmlDocData)

for ingredient in xmlDocTree.iter('ingredient'):
    print ingredient[0].text

To all experienced Python programmers reading this, kindly improve this "newbie" code.

Note: The lxml package looks very good, it definitely worth using. Thanks

Please place the relevant XML text in order to receive a proper answer. Also please consider using lxml for anything xml specific (including html) .

try this :

from lxml import etree

tree=etree.parse("your xml here")
all_recipes=tree.xpath('./recipebook/recipe')
recipe_names=[x.xpath('recipe_name/text()') for x in all_recipes]
ingredients=[x.getparent().xpath('../ingredient_list/ingredients') for x in recipe_names]
ingredient_names=[x.xpath('ingredient_name/text()') for x in ingredients]

Here is the beginning only, but i think you get the idea from here -> get the parent from each ingredient_name and the ingredients/quantities from there and so on. You can't really do any other kind of search i think due to the structured nature of the document.

you can read more on [www.lxml.de]

Some time ago I made a series of screencasts that explain how to collect data from sites. The code is in Python and there are 2 videos about XML parsing with the lxml library. All the videos are posted here: http://railean.net/index.php/2012/01/27/fortune-cowsay-python-video-tutorial

The ones you want are:

  • XPath experiments and query examples
  • Python and LXML, examples of XPath queries with Python
  • Automate page retrieving via HTTP and HTML parsing with lxml

You'll learn how to write and test XPath queries, as well as how to run such queries in Python. The examples are straightforward, I hope you'll find them helpful.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM