简体   繁体   English

Python / lxml:如何捕获HTML表中的行?

[英]Python/lxml: How do I capture a row in an HTML table?

For my stock screening tool, I must switch from BeautifulSoup to lxml in my script. 对于我的股票筛选工具,我必须在脚本中从BeautifulSoup切换到lxml。 After my Python script downloaded the web pages I need to process, BeautifulSoup was able to parse them properly, but the process is too slow. 在我的Python脚本下载了我需要处理的网页之后,BeautifulSoup能够正确地解析它们,但是过程太慢了。 Parsing the balance sheet, income statement, and cash flow statement of just one stock takes BeautifulSoup about 10 seconds, and that is unacceptably slow given that my script has over 5000 stocks to analyze. 仅分析一只股票的资产负债表,损益表和现金流量表就需要BeautifulSoup大约10秒钟,而且考虑到我的脚本要分析5000多只股票,这太慢了,这是无法接受的。

According to some benchmark tests (http://www.crummy.com/2012/1/22/0), lxml is nearly 100 times faster than BeautifulSoup. 根据一些基准测试(http://www.crummy.com/2012/1/22/0),lxml比BeautifulSoup快近100倍。 Thus, lxml should be able to complete within 10 minutes a job that would take BeautifuSoup 14 hours. 因此,lxml应该能够在10分钟内完成需要BeautifuSoup 14小时的工作。

How can I use HTML to capture the contents of a row in an HTML table? 如何使用HTML捕获HTML表中一行的内容? An example of an HTML page that my script has downloaded and needs to parse is at http://www.smartmoney.com/quote/FAST/?story=financials&opt=YB 我的脚本已下载并需要解析的HTML页面的示例位于http://www.smartmoney.com/quote/FAST/?story=financials&opt=YB

The source code that uses BeautifulSoup to parse this HTML table is: 使用BeautifulSoup解析此HTML表的源代码为:

    url_local = local_balancesheet (symbol_input)
    url_local = "file://" + url_local
    page = urllib2.urlopen (url_local)
    soup = BeautifulSoup (page)
    soup_line_item = soup.findAll(text=title_input)[0].parent.parent.parent
    list_output = soup_line_item.findAll('td') # List of elements

If I'm looking for cash and short term investments, title_input = "Cash & Short Term Investments". 如果我正在寻找现金和短期投资,title_input =“现金和短期投资”。

How can I do the same function in lxml? 如何在lxml中执行相同的功能?

You can use the lxml parser with BeautifulSoup, so I don't know why you're doing this. 您可以将lxml解析器与BeautifulSoup一起使用,所以我不知道您为什么这样做。

http://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser http://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser

soup = BeautifulSoup(markup, "lxml")

edit: Here's some code to play with. 编辑:这是一些代码。 This runs in about six seconds for me. 对我来说,这大约需要六秒钟。

def get_page_data(url):
    f = urllib2.urlopen(url)
    soup = BeautifulSoup(f, 'lxml')
    f.close()
    trs = soup.findAll('tr')
    data = {}
    for tr in trs:
        try:
            if tr.div.text.strip() in ('Cash & Short Term Investments', 'Property, Plant & Equipment - Gross',
                               'Total Liabilities', 'Preferred Stock (Carrying Value)'):
                data[tr.div.text] = [int(''.join(e.text.strip().split(','))) for e in tr.findAll('td')]
        except (AttributeError, ValueError):
            # headers dont have a tr tag, and thus raises AttributeError
            # 'Fiscal Year Ending in 2011' raises ValueError
            pass
    return data

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM