Python/lxml: How do I capture a row in an HTML table?

Question

For my stock screening tool, I must switch from BeautifulSoup to lxml in my script. After my Python script downloaded the web pages I need to process, BeautifulSoup was able to parse them properly, but the process is too slow. Parsing the balance sheet, income statement, and cash flow statement of just one stock takes BeautifulSoup about 10 seconds, and that is unacceptably slow given that my script has over 5000 stocks to analyze.

According to some benchmark tests (http://www.crummy.com/2012/1/22/0), lxml is nearly 100 times faster than BeautifulSoup. Thus, lxml should be able to complete within 10 minutes a job that would take BeautifuSoup 14 hours.

How can I use HTML to capture the contents of a row in an HTML table? An example of an HTML page that my script has downloaded and needs to parse is at http://www.smartmoney.com/quote/FAST/?story=financials&opt=YB

The source code that uses BeautifulSoup to parse this HTML table is:

    url_local = local_balancesheet (symbol_input)
    url_local = "file://" + url_local
    page = urllib2.urlopen (url_local)
    soup = BeautifulSoup (page)
    soup_line_item = soup.findAll(text=title_input)[0].parent.parent.parent
    list_output = soup_line_item.findAll('td') # List of elements

If I'm looking for cash and short term investments, title_input = "Cash & Short Term Investments".

How can I do the same function in lxml?

Answer 1

You can use the lxml parser with BeautifulSoup, so I don't know why you're doing this.

http://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser

soup = BeautifulSoup(markup, "lxml")

edit: Here's some code to play with. This runs in about six seconds for me.

def get_page_data(url):
    f = urllib2.urlopen(url)
    soup = BeautifulSoup(f, 'lxml')
    f.close()
    trs = soup.findAll('tr')
    data = {}
    for tr in trs:
        try:
            if tr.div.text.strip() in ('Cash & Short Term Investments', 'Property, Plant & Equipment - Gross',
                               'Total Liabilities', 'Preferred Stock (Carrying Value)'):
                data[tr.div.text] = [int(''.join(e.text.strip().split(','))) for e in tr.findAll('td')]
        except (AttributeError, ValueError):
            # headers dont have a tr tag, and thus raises AttributeError
            # 'Fiscal Year Ending in 2011' raises ValueError
            pass
    return data

Python/lxml: How do I capture a row in an HTML table?

Question

1 answers

solution1
1 2012-11-28 23:07:03

Python/lxml: How do I capture a row in an HTML table?

Question

1 answers

solution1 1 2012-11-28 23:07:03

solution1
1 2012-11-28 23:07:03