简体   繁体   English

使用Python从网页获取表

[英]Get Table from a Web Page with Python

Close-to-none knowledge on python web scraping. 关于python web抓取的知识不多。

I need to get a table from this page: 我需要从获得一个表页面:

http://performance.morningstar.com/funds/etf/total-returns.action?t=IWF

The table I'm interested in is this: 我感兴趣的表是这样的: 在此处输入图片说明 (Disregard the chart above the table) (忽略表格上方的图表)

This is what I have now: 这就是我现在所拥有的:

from selenium import webdriver
from bs4 import BeautifulSoup

# load chrome driver
driver = webdriver.Chrome('C:/.../chromedriver_win32/chromedriver')

# load web page and get source html
link = 'http://performance.morningstar.com/funds/etf/total-returns.action?t=IWF'
driver.get(link)
html = driver.page_source

# make soup and get all tables
soup = BeautifulSoup(html, 'html.parser')
tables = soup.findAll('table',{'class':'r_table3'})
tbl = tables[1]  # ideally we should select table by name

Where do I proceed from here? 我要从这里开始吗?

To get the data from that webpage you can go like this: 要从该网页获取数据,您可以这样:

from selenium import webdriver
from bs4 import BeautifulSoup
import time

driver = webdriver.Chrome()
link = 'http://performance.morningstar.com/funds/etf/total-returns.action?t=IWF'
driver.get(link)
time.sleep(3)

soup = BeautifulSoup(driver.page_source, 'lxml')
driver.quit()

tab_data = soup.select('table')[1]
for items in tab_data.select('tr'):
    item = [elem.text for elem in items.select('th,td')]
    print(' '.join(item))

Partial result: 部分结果:

Total Return %  1-Day 1-Week 1-Month 3-Month YTD 1-Year 3-Year 5-Year 10-Year 15-Year
IWF (Price) 0.13 0.83 2.68 5.67 23.07 26.60 15.52 15.39 8.97 10.14
IWF (NAV) 0.20 0.86 2.66 5.70 23.17 26.63 15.52 15.40 8.98 10.14
S&P 500 TR USD (Price) 0.18 0.52 2.42 4.52 16.07 22.40 13.51 14.34 7.52 9.76

OK so here's how I did it: 好的,这是我的操作方式:

from selenium import webdriver
from bs4 import BeautifulSoup

# load chrome driver
driver = webdriver.Chrome('C:/.../chromedriver_win32/chromedriver')

# load web page and get source html
link = 'http://performance.morningstar.com/funds/etf/total-returns.action?t=IWF'
driver.get(link)
html = driver.page_source

# make soup and get table
soup = BeautifulSoup(html, 'html.parser')
tables = soup.find_all('table',{'class':'r_table3'})
tbl = tables[1]  # ideally we should select table by name

# column and row names
rows = tbl.find_all('tr')
column_names = [x.get_text() for x in rows[0].find_all('th')[1:]]
row_names = [x.find_all('th')[0].get_text() for x in rows[1:]]

# table content
df = pd.DataFrame(columns=column_names, index=row_names)
for row in rows[1:]:
    row_name = row.find_all('th')[0].get_text()
    df.ix[row_name] = [column.get_text() for column in row.find_all('td')]
print(df)

Is there a more elegant way, ie, without looping through columns and rows etc., but an off-the-shelf method that I can call? 有没有更优雅的方法,例如,不循环遍历列和行等,而是可以调用的现成方法?

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM