I have the following sitemap that I am trying to parse:
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url>
<loc>https://www.example.com/examplea</loc>
<priority>0.5</priority>
<lastmod>2019-03-14</lastmod>
<changefreq>daily</changefreq>
</url>
<url>
<loc>https://www.example.com/exampleb</loc>
<priority>0.5</priority>
<lastmod>2019-03-14</lastmod>
<changefreq>daily</changefreq>
</url>
</urlset>
Whats the fastest way to obtain the url links within the loc tags using Python?
I tried using ElementTree, but I think it didnt work because of namespaces.
I need to get " https://www.example.com/examplea " and " https://www.example.com/exampleab "
import re
str = """
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url>
<loc>https://www.example.com/examplea</loc>
<priority>0.5</priority>
<lastmod>2019-03-14</lastmod>
<changefreq>daily</changefreq>
</url>
<url>
<loc>https://www.example.com/exampleb</loc>
<priority>0.5</priority>
<lastmod>2019-03-14</lastmod>
<changefreq>daily</changefreq>
</url>
</urlset>
"""
url = re.findall("<loc>(.*?)</loc>", str)
You can consider to use regular expression.
For your example, your demand can be met by code as follow:
import re
string = '''
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url>
<loc>https://www.example.com/examplea</loc>
<priority>0.5</priority>
<lastmod>2019-03-14</lastmod>
<changefreq>daily</changefreq>
</url>
<url>
<loc>https://www.example.com/exampleb</loc>
<priority>0.5</priority>
<lastmod>2019-03-14</lastmod>
<changefreq>daily</changefreq>
</url>
</urlset>
'''
pattern = '(?<=<loc>)[a-zA-z]+://[^\s]*(?=</loc>)'
re.findall(pattern,string)
The result is ['https://www.example.com/examplea', 'https://www.example.com/exampleb']
As the other answers said, you can use regex. But if you are a bit uncomfortable in using regular expressions, you can also use xmltodict module in python which converts the xml into a dictionary, and you can easily obtain any kind of data you need from the xml.
Using XML but bypassing namespace
from StringIO import StringIO
import xml.etree.ElementTree as ET
xml = '''<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url>
<loc>https://www.example.com/examplea</loc>
<priority>0.5</priority>
<lastmod>2019-03-14</lastmod>
<changefreq>daily</changefreq>
</url>
<url>
<loc>https://www.example.com/exampleb</loc>
<priority>0.5</priority>
<lastmod>2019-03-14</lastmod>
<changefreq>daily</changefreq>
</url>
</urlset>'''
it = ET.iterparse(StringIO(xml))
for _, el in it:
if '}' in el.tag:
el.tag = el.tag.split('}', 1)[1] # strip all namespaces
for at in el.attrib.keys(): # strip namespaces of attributes too
if '}' in at:
newat = at.split('}', 1)[1]
el.attrib[newat] = el.attrib[at]
del el.attrib[at]
root = it.root
urls = [u.text for u in root.findall('.//loc')]
print(urls)
Output
['https://www.example.com/examplea', 'https://www.example.com/exampleb']
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.