Scraping XML data with BS4 “lxml”

Question

Trying to solve problem very similar to this one:

[ Scraping XML element attributes with beautifulsoup

I have the following code:

from bs4 import BeautifulSoup
import requests
r = requests.get('https://www.usda.gov/oce/commodity/wasde/latest.xml')
data = r.text
soup = BeautifulSoup(data, "lxml")
for ce in soup.find_all("Cell"):
    print(ce["cell_value1"])

The code runs without error but does not print any values to the terminal.

I want to extract the "cell_value1" data noted above for the whole page so I have something like this:

2468.58
3061.58
376.64
and so on...

The format of my XML file is the same as the sample in the solution from the question noted above. I identified the appropriate attribute tag specific the attribute I want to scrape. Why are the values not printing to the terminal?

Answer 1

The problem is that you're parsing this file in HTML mode, which means the tags end up named 'cell' instead of 'Cell' . So, you could just search with 'cell' —but the right answer is to parse in XML mode.

To do this, just use 'xml' as your parser instead of 'lxml' . (It's a little non-obvious that 'lxml' means " lxml in HTML mode" and xml means " lxml in XML mode", but it is documented .)

This is explained in Other parser problems :

Because HTML tags and attributes are case-insensitive , all three HTML parsers convert tag and attribute names to lowercase. That is, the markup <TAG></TAG> is converted to <tag></tag> . If you want to preserve mixed-case or uppercase tags and attributes, you'll need to parse the document as XML .

Your code is still fail because of a second problem: some of the Cell nodes are empty, and do not have a cell_value1 attribute to print out, but you're trying to print it out unconditionally.

So, what you want is something like this:

soup = BeautifulSoup(data, "xml")
for ce in soup.find_all("Cell"):
    try:
        print(ce["cell_value1"])
    except KeyError:
        pass

Scraping XML data with BS4 “lxml”

Question

1 answers

solution1
4 2018-04-03 21:47:36

Scraping XML data with BS4 “lxml”

Question

1 answers

solution1 4 2018-04-03 21:47:36

solution1
4 2018-04-03 21:47:36