Python / lxml：如何捕獲HTML表中的行？

Question

對於我的股票篩選工具，我必須在腳本中從BeautifulSoup切換到lxml。 在我的Python腳本下載了我需要處理的網頁之后，BeautifulSoup能夠正確地解析它們，但是過程太慢了。 僅分析一只股票的資產負債表，損益表和現金流量表就需要BeautifulSoup大約10秒鍾，而且考慮到我的腳本要分析5000多只股票，這太慢了，這是無法接受的。

根據一些基准測試（http://www.crummy.com/2012/1/22/0)，lxml比BeautifulSoup快近100倍。 因此，lxml應該能夠在10分鍾內完成需要BeautifuSoup 14小時的工作。

如何使用HTML捕獲HTML表中一行的內容？ 我的腳本已下載並需要解析的HTML頁面的示例位於http://www.smartmoney.com/quote/FAST/?story=financials&opt=YB

使用BeautifulSoup解析此HTML表的源代碼為：

    url_local = local_balancesheet (symbol_input)
    url_local = "file://" + url_local
    page = urllib2.urlopen (url_local)
    soup = BeautifulSoup (page)
    soup_line_item = soup.findAll(text=title_input)[0].parent.parent.parent
    list_output = soup_line_item.findAll('td') # List of elements

如果我正在尋找現金和短期投資，title_input =“現金和短期投資”。

如何在lxml中執行相同的功能？

Answer 1

您可以將lxml解析器與BeautifulSoup一起使用，所以我不知道您為什么這樣做。

http://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser

soup = BeautifulSoup(markup, "lxml")

編輯：這是一些代碼。 對我來說，這大約需要六秒鍾。

def get_page_data(url):
    f = urllib2.urlopen(url)
    soup = BeautifulSoup(f, 'lxml')
    f.close()
    trs = soup.findAll('tr')
    data = {}
    for tr in trs:
        try:
            if tr.div.text.strip() in ('Cash & Short Term Investments', 'Property, Plant & Equipment - Gross',
                               'Total Liabilities', 'Preferred Stock (Carrying Value)'):
                data[tr.div.text] = [int(''.join(e.text.strip().split(','))) for e in tr.findAll('td')]
        except (AttributeError, ValueError):
            # headers dont have a tr tag, and thus raises AttributeError
            # 'Fiscal Year Ending in 2011' raises ValueError
            pass
    return data

Python / lxml：如何捕獲HTML表中的行？

問題描述

1 個解決方案

解決方案1
1 2012-11-28 23:07:03

Python / lxml：如何捕獲HTML表中的行？

問題描述

1 個解決方案

解決方案1 1 2012-11-28 23:07:03

解決方案1
1 2012-11-28 23:07:03