简体   繁体   中英

beautiful soup - python - table scraping

Trying to scrape a table from a website using beautiful soup in order to parse the data. How would I go about parsing it by its headers? So far, I can't even manage to print the entire table.Thanks in advance.

Here is the code:

import urllib2
from bs4 import BeautifulSoup

optionstable = "http://www.barchart.com/options/optdailyvol?type=stocks"
page = urllib2.urlopen(optionstable)
soup = BeautifulSoup(page, 'lxml')

table = soup.find("div", {"class":"dataTables_wrapper","id": "options_wrapper"})

table1 = table.find_all('table')

print table1

You need to mimic the ajax request to get the table data:

import requests
from time import time

optionstable = "http://www.barchart.com/options/optdailyvol?type=stocks"


params = {"type": "stocks",
          "dir": "desc",
          "_": str(time()),
          "f": "base_symbol,type,strike,expiration_date,bid,ask,last,volume,open_interest,volatility,timestamp",
          "sEcho": "1",
          "iDisplayStart": "0",
          "iDisplayLength": "100",
          "iSortCol_0": "7",
          "sSortDir_0": "desc",
          "iSortingCols": "1",
          "bSortable_0": "true",
          "bSortable_1": "true",
          "bSortable_2": "true",
          "bSortable_3": "true",
          "bSortable_4": "true",
          "bSortable_5": "true",
          "bSortable_6": "true",
          "bSortable_7": "true",
          "bSortable_82": "true",
          "bSortable_9": "true",
          "bSortable_10": "true",
          "sortby": "Volume"}

Then do a get passing the params:

js = requests.get("http://www.barchart.com/option-center/getData.php", params=params).json()

Which gives you:

{u'aaData': [[u'<a href="/quotes/BAC">BAC</a>', u'Call', u'16.00', `u'12/16/16', u'0.89', u'0.90', u'0.91', u'52,482', u'146,378', u'0.26', u'01:43'], [u'<a href="/quotes/ETE">ETE</a>', u'Call', u'20.00', u'01/20/17', u'0.38', u'0.41', u'0.40', u'40,785', u'72,011', u'0.42', u'01:27'], [u'<a href="/quotes/BAC">BAC</a>', u'Call', u'15.00', u'10/21/16', u'1.34', u'1.36', u'1.33', u'35,663', u'90,342', u'0.35', u'01:44'], [u'<a href="/quotes/COTY">COTY</a>', u'Put', u'38.00', u'10/21/16', u'15.00', u'15.30', u'15.10', u'32,321', u'242,382', u'1.24', u'01:44'], [u'<a href="/quotes/COTY">COTY</a>', u'Call', u'38.00', u'10/21/16', u'0.00', u'0.05', u'0.01', u'32,320', u'256,589', u'1.34', u'01:44'], [u'<a href="/quotes/WFC">WFC</a>', u'Put', u'40.00', u'10/21/16', u'0.01', u'0.03', u'0.02', u'32,121', u'37,758', u'0.39', u'01:43'], [u'<a href="/quotes/WFC">WFC</a>', u'Put', u'40.00', u'11/18/16', u'0.16', u'0.17', u'0.16', u'32,023', u'8,789', u'0.30', u'01:44']..................

There are many more params you can pass, if you watch the request in chrome tools under the XHR tab, you can see all the params , the ones above were the minimum needed to get a result. There are quite a lot so I won't post them all here and leave you to figure out how you can influence the results.

If you iterate over js[u'aaData'] , you can see each sublist where each entry corresponds to the columns like:

#base_symbol,type,strike,expiration_date,bid,ask,last,volume,open_interest,volatility,timestamp

[u'<a href="/quotes/AAPL">AAPL</a>', u'Call', u'116.00', u'10/14/16', u'1.36', u'1.38', u'1.37', u'21,812', u'7,258', u'0.23', u'10/10/16']

So if you want to filter rows based on some criteria ie strike > 15:

for d in filter(lambda row: float(row[2]) > 15, js[u'aaData']):
    print(d)

You might also find pandas useful, with a little bit of tidying up we can create a nice df:

# extract base_symbol text
for v in js[u'aaData']:
    v[0] = BeautifulSoup(v[0]).a.text


import  pandas as pd
cols = "base_symbol,type,strike,expiration_date,bid,ask,last,volume,open_interest,volatility,timestamp"

df = pd.DataFrame(js[u'aaData'],
    columns=cols.split(","))

print(df.head(5))

That gives you a nice df to work with:

  base_symbol  type strike expiration_date    bid    ask   last  volume  \
0         BAC  Call  16.00        12/16/16   0.89   0.90   0.91  52,482   
1         ETE  Call  20.00        01/20/17   0.38   0.41   0.40  40,785   
2         BAC  Call  15.00        10/21/16   1.34   1.36   1.33  35,663   
3        COTY   Put  38.00        10/21/16  15.00  15.30  15.10  32,321   
4        COTY  Call  38.00        10/21/16   0.00   0.05   0.01  32,320   

  open_interest volatility timestamp  
0       146,378       0.26  10/10/16  
1        72,011       0.42  10/10/16  
2        90,342       0.35  10/10/16  
3       242,382       1.24  10/10/16  
4       256,589       1.34  10/10/16  

You might just want to change the dtypes df["strike"] = df["strike"].astype(float) etc..

The page is a dynamic page. It uses the DataTables jquery plugin to display the table. If you look at the page source you'll see an empty table. It is filled after the page is loaded.

You have two choices here. Either figure out which URL the Javascript code running in the browser accesses to get the table data, and hope that simply accessing this URL will work. If it does, the response will probably be in JSON, so you don't need to parse anything.

The second choice is to use a tool like Selenium or PhantomJS to load the page and run the Javascript code. Then you can access the already populated table and get its data.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM