[英]python : parse table using beautifulsoup
我正在嘗試從以下網站提取表格: personal.vanguard.com
我正在嘗試獲取“控股”和“市場價值”列。
我已經嘗試過該查詢,但是沒有運氣:
from bs4 import BeautifulSoup
import urllib2
soup = BeautifulSoup(urllib2.urlopen('https://personal.vanguard.com/us/FundsAllHoldings?FundId=0970&FundIntExt=INT&tableName=Equity&tableIndex=0').read())
print(soup.prettify())
print soup('tbody')
table = soup.find("tbody", { "class" : "Holding" })
print table
for row in table.findAll("tr"):
cells = row.findAll("td")
您可以使用此表達式選擇所有行:
soup.select('tbody tr')
然后,對於每一行,您可以提取所有列:
[tr('td') for tr in soup.select('tbody tr')]
# Example output (note the first empty row):
[[],
[<td align="left">zulily Inc. Class A</td>,
<td>965,202</td>,
<td class="nr">$12,750,318</td>],
[<td align="left">xG Technology Inc.</td>,
<td>34,385</td>,
<td class="nr">$57,767</td>],
[<td align="left">vTv Therapeutics Inc. Class A</td>,
<td>80,223</td>,
<td class="nr">$802,230</td>],
[<td align="left">salesforce.com inc</td>,
<td>11,014,606</td>,
<td class="nr">$807,370,620</td>],
[<td align="left">pSivida Corp.</td>,
<td>447,326</td>,
<td class="nr">$1,816,144</td>],
[<td align="left">lululemon athletica Inc.</td>,
<td>1,737,050</td>,
<td class="nr">$109,190,963</td>]]
您只需要過濾必填列即可。
from bs4 import BeautifulSoup
import urllib2
url = 'https://personal.vanguard.com/us/FundsAllHoldings?FundId=0970&FundIntExt=INT&tableName=Equity&tableIndex=0'
soup = BeautifulSoup(urllib2.urlopen(url))
table = soup.find("tbody", { "class" : "right" })
for row in table.findAll("tr"):
cells = row.findAll("td")
if len(cells) > 0: # skip first row
holding = cells[0]
mv = cells[2]
print holding, mv
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.