简体   繁体   中英

Parsing XML with python and ElementTree

I am do class project where I have to save a list of links to a text file.

I given the XML and am trying to iterate through all the url's but am troubles.

I have tried using element tree but can not iterate through the I read many other questions and tried that with no success. Please help

The structure like this

<urlset xmlns="http://www.crawlingcourse.com/sitemap/1.3">
  <url>
     <loc>
        http://www.crawlingcourse.com/item-3911512
     </loc>
  </url>
<url>....

I suggest you to use lxml to efficiently parse an XML file.

from lxml import etree

Your XML sample is not well-formed, I fixed it like this:

content = """\
<urlset xmlns="http://www.crawlingcourse.com/sitemap/1.3">
  <url>
     <loc>
        http://www.crawlingcourse.com/item-3911512
     </loc>
  </url>
</urlset>"""

To parse a file, you can use etree.parse() . But since this sample is a string, I use etree.XML() :

tree = etree.XML(content)

The natural way to search elements in a XML tree is using XPath. For instance, you can do that:

loc_list = tree.xpath("//url/loc")

But You'll get nothing:

for loc in loc_list:
    print(loc.text)
# None

The reason, an it is probably your problem, is that <urlset> use a default namespace: " http://www.crawlingcourse.com/sitemap/1.3 ".

To make it work, you need to use xpath() function with this namespace. Let's give a name to this namespace: "s":

NS = {'s': "http://www.crawlingcourse.com/sitemap/1.3"}

Then, use the s prefix in your XPath expression like this:

loc_list = tree.xpath("//s:url/s:loc", namespaces=NS)

for loc in loc_list:
    print(loc.text)
#     http://www.crawlingcourse.com/item-3911512

Because your XML is indented, you need to strip the spaces:

for loc in loc_list:
    url = loc.text.strip()
    print(url)
# http://www.crawlingcourse.com/item-3911512

Well, the issue really is the namespace.

Here's working code:

from xml.etree.cElementTree import XML, fromstring, tostring, ElementTree
xml_string = '<?xml version="1.0"?><urlset><url><loc>http://www.crawlingcourse.com/item-3911512</loc></url></urlset>'
tree = ElementTree(fromstring(xml_string))
print [elem.text for elem in tree.iter(tag='loc')]

Now, if you want to add <urlset xmlns="http://www.crawlingcourse.com/sitemap/1.3"> , the tags are going to be different. From http://www.w3schools.com/xml/xml_namespaces.asp :

XML Namespaces - The xmlns Attribute. When using prefixes in XML, a namespace for the prefix must be defined. The namespace can be defined by an xmlns attribute in the start tag of an element. The namespace declaration has the following syntax. xmlns:prefix="URI".

Threw me off too!

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM