简体   繁体   中英

Extracting text from a table with python and lxml

I recently saw that another user had asked a question about extracting information from a web table Extracting information from a webpage with python . The answer from ekhumoro works great on the page that the other user asked. See below.

from urllib2 import urlopen
from lxml import etree

url = 'http://www.uscho.com/standings/division-i-men/2011-2012/'

tree = etree.HTML(urlopen(url).read())

for section in tree.xpath('//section[starts-with(@id, "section_")]'):
    print section.xpath('h3[1]/text()')[0]
    for row in section.xpath('table/tbody/tr'):
        cols = row.xpath('td//text()')
        print '  ', cols[0].ljust(25), ' '.join(cols[1:])
    print

My problem is using this code as a guide to parse this page http://www.uscho.com/rankings/di-mens-poll/ . Using the following changes I can only get h1 and h3 to print.

Input

url = 'http://www.uscho.com/rankings/d-i-mens-poll/'
tree = etree.HTML(urlopen(url).read())

for section in tree.xpath('//section[starts-with(@id, "rankings")]'):
    print section.xpath('h1[1]/text()')[0]
    print section.xpath('h3[1]/text()')[0]
    for row in section.xpath('table/tbody/tr'):
        cols = row.xpath('td/b/text()')
        print '  ', cols[0].ljust(25), ' '.join(cols[1:])
    print

Output

USCHO.com Division I Men's Poll
December 12, 2011

The structure of the table seems to be the same so I'm at a loss as to why I can't use similar code. I'm just a mechanical engineer in way over my head. Any help is appreciated.

lxml is great, but if you're not familiar with xpath , I recommend you BeautifulSoup :

from urllib2 import urlopen
from BeautifulSoup import BeautifulSoup

url = 'http://www.uscho.com/rankings/d-i-mens-poll/'
soup = BeautifulSoup(urlopen(url).read())

section = soup.find('section', id='rankings')
h1 = section.find('h1')
print h1.text
h3 = section.find('h3')
print h3.text
print

rows = section.find('table').findAll('tr')[1:-1]
for row in rows:
    columns = [data.text for data in row.findAll('td')[1:]]
    print '{0:20} {1:4} {2:>6} {3:>4}'.format(*columns)

The output for this script is:

USCHO.com Division I Men's Poll
December 12, 2011

Minnesota-Duluth     (49) 12-3-3  999
Minnesota                 14-5-1  901
Boston College            12-6-0  875
Ohio State           ( 1) 13-4-1  848
Merrimack                 10-2-2  844
Notre Dame                11-6-3  667
Colorado College           9-5-0  650
Western Michigan           9-4-5  647
Boston University         10-5-1  581
Ferris State              11-6-1  521
Union                      8-3-5  510
Colgate                   11-4-2  495
Cornell                    7-3-1  347
Denver                     7-6-3  329
Michigan State            10-6-2  306
Lake Superior             11-7-2  258
Massachusetts-Lowell      10-5-0  251
North Dakota               9-8-1   88
Yale                       6-5-1   69
Michigan                   9-8-3   62

The structure of the table is slightly different, and there are columns with blank entries.

Possible lxml solution:

from urllib2 import urlopen
from lxml import etree

url = 'http://www.uscho.com/rankings/d-i-mens-poll/'
tree = etree.HTML(urlopen(url).read())

for section in tree.xpath('//section[@id="rankings"]'):
    print section.xpath('h1[1]/text()')[0],
    print section.xpath('h3[1]/text()')[0]
    print
    for row in section.xpath('table/tr[@class="even" or @class="odd"]'):
        print '%-3s %-20s %10s %10s %10s %10s' % tuple(
            ''.join(col.xpath('.//text()')) for col in row.xpath('td'))
    print

Output:

USCHO.com Division I Men's Poll December 12, 2011

1   Minnesota-Duluth           (49)     12-3-3        999          1
2   Minnesota                           14-5-1        901          2
3   Boston College                      12-6-0        875          3
4   Ohio State                 ( 1)     13-4-1        848          4
5   Merrimack                           10-2-2        844          5
6   Notre Dame                          11-6-3        667          7
7   Colorado College                     9-5-0        650          6
8   Western Michigan                     9-4-5        647          8
9   Boston University                   10-5-1        581         11
10  Ferris State                        11-6-1        521          9
11  Union                                8-3-5        510         10
12  Colgate                             11-4-2        495         12
13  Cornell                              7-3-1        347         16
14  Denver                               7-6-3        329         13
15  Michigan State                      10-6-2        306         14
16  Lake Superior                       11-7-2        258         15
17  Massachusetts-Lowell                10-5-0        251         18
18  North Dakota                         9-8-1         88         19
19  Yale                                 6-5-1         69         17
20  Michigan                             9-8-3         62         NR

Though this answer is old, it still comes up on the web.

I'd like another (more straightforward and uptodate) option. It adds up more dependancies though (pandas & tabulate (which is a dependancy to the to_markdown method))...

Unfortunatelly, I think the webpage associated to the url used in this question has changed quite a lot since then (the table is now generated from javascript and is not in the source code anymore). So I'll skip to this url instead for practical purpose.

from lxml import etree, html
import pandas as pd
import requests

url = 'https://www.w3schools.com/html/html_tables.asp'
r = requests.get(url)

#If you want to get a specific table, procede as follow :

tree_html = html.fromstring(r.content)
first_table = tree_html.xpath(".//table")[0]
df = pd.read_html(etree.tostring(table))[0]
print(df.to_markdown())

Output :

|    | Tag        | Description                                                             |
|---:|:-----------|:------------------------------------------------------------------------|
|  0 | <table>    | Defines a table                                                         |
|  1 | <th>       | Defines a header cell in a table                                        |
|  2 | <tr>       | Defines a row in a table                                                |
|  3 | <td>       | Defines a cell in a table                                               |
|  4 | <caption>  | Defines a table caption                                                 |
|  5 | <colgroup> | Specifies a group of one or more columns in a table for formatting      |
|  6 | <col>      | Specifies column properties for each column within a <colgroup> element |
|  7 | <thead>    | Groups the header content in a table                                    |
|  8 | <tbody>    | Groups the body content in a table                                      |
|  9 | <tfoot>    | Groups the footer content in a table                                    |

But you can also get all tables in one shot, this way :

list_tables = pd.read_html(r.content)
for table in list_table:
  print(table.to_markdown()+'\n')

Output :

|    | Company                      | Contact          | Country   |
|---:|:-----------------------------|:-----------------|:----------|
|  0 | Alfreds Futterkiste          | Maria Anders     | Germany   |
|  1 | Centro comercial Moctezuma   | Francisco Chang  | Mexico    |
|  2 | Ernst Handel                 | Roland Mendel    | Austria   |
|  3 | Island Trading               | Helen Bennett    | UK        |
|  4 | Laughing Bacchus Winecellars | Yoshi Tannamuri  | Canada    |
|  5 | Magazzini Alimentari Riuniti | Giovanni Rovelli | Italy     |

|    | Tag        | Description                                                             |
|---:|:-----------|:------------------------------------------------------------------------|
|  0 | <table>    | Defines a table                                                         |
|  1 | <th>       | Defines a header cell in a table                                        |
|  2 | <tr>       | Defines a row in a table                                                |
|  3 | <td>       | Defines a cell in a table                                               |
|  4 | <caption>  | Defines a table caption                                                 |
|  5 | <colgroup> | Specifies a group of one or more columns in a table for formatting      |
|  6 | <col>      | Specifies column properties for each column within a <colgroup> element |
|  7 | <thead>    | Groups the header content in a table                                    |
|  8 | <tbody>    | Groups the body content in a table                                      |
|  9 | <tfoot>    | Groups the footer content in a table                                    |

'table/tbody/tr'替换为'table/tr'

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM