Python / lxml：如何捕获HTML表中的行？

Question

对于我的股票筛选工具，我必须在脚本中从BeautifulSoup切换到lxml。 在我的Python脚本下载了我需要处理的网页之后，BeautifulSoup能够正确地解析它们，但是过程太慢了。 仅分析一只股票的资产负债表，损益表和现金流量表就需要BeautifulSoup大约10秒钟，而且考虑到我的脚本要分析5000多只股票，这太慢了，这是无法接受的。

根据一些基准测试（http://www.crummy.com/2012/1/22/0)，lxml比BeautifulSoup快近100倍。 因此，lxml应该能够在10分钟内完成需要BeautifuSoup 14小时的工作。

如何使用HTML捕获HTML表中一行的内容？ 我的脚本已下载并需要解析的HTML页面的示例位于http://www.smartmoney.com/quote/FAST/?story=financials&opt=YB

使用BeautifulSoup解析此HTML表的源代码为：

    url_local = local_balancesheet (symbol_input)
    url_local = "file://" + url_local
    page = urllib2.urlopen (url_local)
    soup = BeautifulSoup (page)
    soup_line_item = soup.findAll(text=title_input)[0].parent.parent.parent
    list_output = soup_line_item.findAll('td') # List of elements

如果我正在寻找现金和短期投资，title_input =“现金和短期投资”。

如何在lxml中执行相同的功能？

Answer 1

您可以将lxml解析器与BeautifulSoup一起使用，所以我不知道您为什么这样做。

http://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser

soup = BeautifulSoup(markup, "lxml")

编辑：这是一些代码。 对我来说，这大约需要六秒钟。

def get_page_data(url):
    f = urllib2.urlopen(url)
    soup = BeautifulSoup(f, 'lxml')
    f.close()
    trs = soup.findAll('tr')
    data = {}
    for tr in trs:
        try:
            if tr.div.text.strip() in ('Cash & Short Term Investments', 'Property, Plant & Equipment - Gross',
                               'Total Liabilities', 'Preferred Stock (Carrying Value)'):
                data[tr.div.text] = [int(''.join(e.text.strip().split(','))) for e in tr.findAll('td')]
        except (AttributeError, ValueError):
            # headers dont have a tr tag, and thus raises AttributeError
            # 'Fiscal Year Ending in 2011' raises ValueError
            pass
    return data

Python / lxml：如何捕获HTML表中的行？

问题描述

1 个解决方案

解决方案1
1 2012-11-28 23:07:03

Python / lxml：如何捕获HTML表中的行？

问题描述

1 个解决方案

解决方案1 1 2012-11-28 23:07:03

解决方案1
1 2012-11-28 23:07:03