简体   繁体   中英

How to call specific table on webpage with multiple tables

I am trying to pull information from a specific table on this website https://www.wsj.com/market-data . This is my code so far. I am new to python in case its not obvious. I want to pull just the information in the Bonds table. Can I index the table so that I can call specific tables by an index #?

from selenium import webdriver
from bs4 import BeautifulSoup
import requests
browser=webdriver.Chrome
chrome_path=r"C:\Users\ddai\AppData\Local\Programs\Python\Python37\chromedriver.exe"
driver=webdriver.Chrome(chrome_path)
url2="https://wsj.com/market-data"
driver.get(url2)
html=driver.execute_script("return document.documentElement.outerHTML")
sel_soup=BeautifulSoup(html,'html.parser') 
print(len(sel_soup.findAll("table"))) 

You could. You have a couple approaches to this. I presume that the code you posted grabs the full HTML of the page, and if so, you can play around with that. On the page, if using Chrome, press F12 -- this lets you see the full HTML (or just print it in Python and take a gander).

Option 1: Grab Table by the Name 'Bonds'

Look at how the DOM is organized -- the name "Bonds" is a cousin of the table, making it a bit tricky to relate it to the table. Grossly Simplified Layout:

<div>
    <div>
        <h3><span>Bonds</span></h3>
    </div>
    <article></article>
    <div class="WSJTables--tableWrapper--2oPULowO WSJBase--card--3gQ6obvQ ">
        <table class="WSJTables--table--1SdkiG8p WSJTheme--table--2a-shks_ ">
            <thead class="WSJTables--table__head--hprNkLrs WSJTheme--table__head--3n6NRMJE ">

The table and the name "Bonds" share grandparents, though. You can use a regexp to find a title bonds -- something like <h(?<hNum>[1-6]).*>.*Bonds.*</h\\k<hNum>> -- and then find the next table in the script by either finding the very next instance of <table.*> , or using some DOM mapping tool that lets you travel up to the grandparent and then narrow down to the table. Then you can find the table in its full, glorious HTML form and parse it from there. This approach is perhaps the most reliable, as the name you search for could be changed to find any of the other tables.

Alternatively, you could also search for the table by its class name, which seems to be unique, but is not guaranteed to be so and thus verges on hard-coding.


Option 2: Indexing

In an indexing approach, you can find all <table.*>.*?</table> , looping through them and assigning them to an array. However, you would likely need to hardcode which table index you want: if the bonds table is the 5th table to be indexed, you would have to reference it as table[5] , which is less than ideal. If this is for a simple and temporary program, or the page will remain relatively static for a long time, this approach could work fine and it would be relatively simple. However, if the layout of the page changes at all or your program needs to be dynamic, it may be worth he effort to build a more reliable approach.

Once you select the header of that container in which the desired tabe is, the rest are easy. You can also get the table from script tag using requests only and then post process the content using json() . However, try the script below to get the tabular content using selenium.

from selenium import webdriver
from bs4 import BeautifulSoup

with webdriver.Chrome() as driver:
    driver.get("https://wsj.com/market-data")
    sel_soup = BeautifulSoup(driver.page_source,'html.parser') 
    for items in sel_soup.select_one("[class*='card__header']:contains(Bonds)").find_parent().select("table tr"):
        data = [item.get_text(strip=True) for item in items.select("th,td")]
        print(data)

Output:

['Country', 'Yield(%)', 'Yield Chg']
['U.S. 10 Year', '2.040', '0.090']
['Germany 10 Year', '-0.362', '0.034']
['U.K. 10 Year', '0.739', '0.059']
['Japan 10 Year', '-0.163', '-0.008']
['Australia 10 Year', '1.293', '-0.010']
['China 10 Year', '3.176', '-0.012']

There is another endpoint you can use which can be found in network tab when clicking to view bonds page. The json returned can be parsed to retrieve the table. This brings back slightly more info than shown at url you gave but includes what you require.

import requests
from bs4 import BeautifulSoup as bs
import pandas as pd
import re

def get_bond_info(r):
    data = r['data']['instruments']
    results = []
    for item in data:
        results.append(
        {'Name': item['djLegalName'],
         'CPN (%)': item['couponPercent'],
         'LATEST SPREAD OVER TREASURY': item['spread'],
         'YLD (%)': item['yieldPercent'],
         'YLD CHG': item['yieldChange']
        }
        ) 
    df = pd.DataFrame(results, columns = ['Name', 'CPN (%)', 'Spread','YLD (%)','YLD CHG' ])
    return df

identifier = '''{"application":"WSJ",
                "bonds":[
                    {"symbol":"TMUBMUSD10Y","name":"U.S."},
                    {"symbol":"TMBMKDE-10Y","name":"Germany"},
                    {"symbol":"TMBMKGB-10Y","name":"U.K."},
                    {"symbol":"TMBMKJP-10Y","name":"Japan"},
                    {"symbol":"TMBMKAU-10Y","name":"Australia"},
                    {"symbol":"AMBMKRM-10Y","name":"China"},
                    {"symbol":"TMBMKNZ-10Y","name":"New Zealand"},
                    {"symbol":"TMBMKFR-10Y","name":"France"},
                    {"symbol":"TMBMKIT-10Y","name":"Italy"},
                    {"symbol":"TMBMKES-10Y","name":"Spain"}
                ]
             }'''


url = 'https://www.wsj.com/market-data/bonds?id=' + re.sub('\n\s+','',identifier) + '&type=mdc_governmentbonds'
r = requests.get(url).json()
print(get_bond_info(r))

If you familiarise youself with the construction you can retrieve other info such as different time periods

import requests
from bs4 import BeautifulSoup as bs
import pandas as pd

url = 'https://www.wsj.com/market-data/bonds?id={"application":"WSJ","instruments":[{"symbol":"BOND/BX//TMUBMUSD30Y","name":"30-Year Bond"},{"symbol":"BOND/BX//TMUBMUSD10Y","name":"10-Year Note"},{"symbol":"BOND/BX//TMUBMUSD07Y","name":"7-Year Note"},{"symbol":"BOND/BX//TMUBMUSD05Y","name":"5-Year Note"},{"symbol":"BOND/BX//TMUBMUSD03Y","name":"3-Year Note"},{"symbol":"BOND/BX//TMUBMUSD02Y","name":"2-Year Note"},{"symbol":"BOND/BX//TMUBMUSD01Y","name":"1-Year Bill"},{"symbol":"BOND/BX//TMUBMUSD06M","name":"6-Month Bill"},{"symbol":"BOND/BX//TMUBMUSD03M","name":"3-Month Bill"},{"symbol":"BOND/BX//TMUBMUSD01M","name":"1-Month Bill"}]}&type=mdc_quotes'
r = requests.get(url).json()
data = r['data']['instruments']
results = []
for item in data:
    results.append(
    {'Name': item['formattedName'],
     'CPN (%)': item['bond']['couponRate'],
     'PRC CHG': item['bond']['formattedTradePriceChange'],
     'YLD (%)': item['bond']['yield'],
     'YLD CHG': item['bond']['yieldChange']
    }
    )

df = pd.DataFrame(results, columns = ['Name', 'CPN (%)', 'PRC CHG','YLD (%)','YLD CHG' ])
print(df)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM