简体   繁体   中英

Python - Extract data from a specific table in a page

just started to learn python. Spent the whole weekend for this project but the progress is terrible. Hopefully can get some guidance from the community.

Part of my tutorial required me to extract data from a google finance page. https://www.google.com/finance . But only the sector summary table. And then organize them into a JSON dump.

The questions I have so far is:

1) How to extract data from sector summary table only? I can find_all using but the result come back include other table as well.

2) How do I get the change for each sectors ie: (energy : 0.99% , basic material : 0.31%, industrials : 0.17%). There are no unique tag I can used. The only characters is these numbers are below the same as the sector name

Looking at the page (either using View Source or your browser's developer tools), we know a few things:

  • The sector summary table is the only one inside a div tag with id=secperf (probably short for 'sector performance').
  • For every row except the first, the first cell from the left contains the sector name; the second one from the left contains the change percentage.
  • The other cells might contain bar graphs. The bar graphs also happen to be tables, but we want to ignore them, so we shouldn't recurse into them.

There are many ways to approach this. One way would be as follows:

def sector_summary(document):
    table = document.find(id='secperf').find('table')
    rows = table.find_all('tr', recursive=False)

    for row in rows[1:]:
        cells = row.find_all('td')

        sector = cells[0].get_text().strip()
        change = cells[1].get_text().strip()

        yield (sector, change)

print(dict(sector_summary(my_document)))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM