简体   繁体   中英

XML Parsing with ElementTree and Requests

I am trying to work with the Yahoo Weather API, but I am having a few issues parsing the XML that the API responds with. I am using Python 3.4 . Here's the code I am working with:

weather_url = 'http://weather.yahooapis.com/forecastrss?w=%s&u=%s'
url = weather_url % (zip_code, units)

try:
    rss = parse(requests.get(url, stream=True).raw).getroot()

    conditions = rss.find('channel/item/{%s}condition' % weather_ns)

    return {
        'current_condition': conditions.get('text'),
        'current_temp': conditions.get('temp'),
        'title': rss.findtext('channel/title')
    }
except:
    raise

Here's the stack trace that I am getting:

Traceback (most recent call last):
  File "<input>", line 1, in <module>
  File "/home/jonathan/PycharmProjects/pyweather/pyweather/pyweather.py", line 42, in yahoo_conditions
    rss = parse(requests.get(url, stream=True).raw).getroot()
  File "/usr/lib/python3.4/xml/etree/ElementTree.py", line 1187, in parse
    tree.parse(source, parser)
  File "/usr/lib/python3.4/xml/etree/ElementTree.py", line 598, in parse
    self._root = parser._parse_whole(source)
  File "<string>", line None
xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 1, column 0

The xml.etree.ElementTree parse function doesn't like the raw object returned by the requests library. Looking into it a little bit deeper, the raw object resolves to

>>> r = requests.get('http://weather.yahooapis.com/forecastrss?w=2502265', stream=True)
>>> r.raw
<requests.packages.urllib3.response.HTTPResponse object at 0x7f32c24f9e48>

I referenced this solution , but it's still leading to the same issue. Why doesn't the approach above work? Is the urllib3 response object not supported with the ElementTree.parse function? I have read all of the docs, but they haven't enlightened me at all.

The doc list is here:

Edit: After more experimentation, I still haven't found a solution to the problem outlined above. However, I have found a workaround. If you use the ElementTree's fromstring method on the XML content, everything works fine.

def fetch_xml(url):
    """
    Fetch a url and parse the document's XML.

    :param url: the URL that the XML is located at.
    :return: the root element of the XML.
    :raises:
        :requests.exceptions.RequestException: Requests could not open the URL.
        :xml.etree.ElementTree.ParseError: xml.etree.ElementTree failed to parse the XML document.
    """

    return ET.fromstring(requests.get(url).content)

I guess the downside to this approach is that it uses more memory. What do you think? I'd like to get the communities opinion.

Why are you using streaming with requests to download some RSS XML data? Do you want to keep a connection open all the time? Weather hardly changes that quickly, so why not just poll the service every 5 minutes instead?

Below is the complete code for doing a poll and parsing using BeautifulSoup and requests. Short and sweet.

import requests
from bs4 import BeautifulSoup

r = requests.get('http://weather.yahooapis.com/forecastrss?w=%s&u=%s' % (2459115, "c"))
if r.status_code == 200:
    soup = BeautifulSoup(r.text)
    print("Current condition: ", soup.find("description").string)
    print("Temperature: ", soup.find('yweather:condition')['temp'])
    print("Title: ", soup.find("title").string)
else:
    r.raise_for_status()

Output:

Current condition:  Yahoo! Weather for New York, NY
Temperature:  28
Title:  Yahoo! Weather - New York, NY

There is a lot more you can do with Beautifulsoup. Look up its excellent documentation.

If you use the ElementTree's fromstring method on the XML content, everything works fine.

def fetch_xml(url):
    """
    Fetch a url and parse the document's XML.

    :param url: the URL that the XML is located at.
    :return: the root element of the XML.
    :raises:
        :requests.exceptions.RequestException: Requests could not open the URL.
        :xml.etree.ElementTree.ParseError: xml.etree.ElementTree failed to parse the XML document.
    """

    return ET.fromstring(requests.get(url).content)

I guess the downside to this approach is that it uses more memory.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM