简体   繁体   中英

Python Beautiful Soup - Span class text not extracted

I'm using beautiful soup for the first time and the text from the span class is not being extracted. I'm not familiarized with HTML so I'm unsure as to why this happens, so it'd be great to understand.

I've used the code below:

from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup

my_url = 'https://www.anz.com.au/personal/home-loans/your-loan/interest-rates/#varhome'

uClient = uReq(my_url)
page_html = uClient.read()
uClient.Close()
page_soup = soup(page_html, "html.parser")

content = page_soup.findAll("span",attrs={"data-item":"rate"})

With this code for index 0 it returns the following:

<span class="productdata" data-baserate-code="VRI" data-cc="AU" data-
item="rate" data-section="PHL" data-subsection="VR"></span>

However I'd expect something like this when I inspect via Chrome, which has the text such as the interest rate:

<span class="productdata" data-cc="AU" data-section="PHL" data-
subsection="VR" data-baserate-code="VRI" data-item="rate">5.20% p.a.</span>

Data you are trying to extract does not exists. It is loaded using JS after the page is loaded. Website uses a JSON api to load information on the page. So Beautiful soup can not find the data. Data can be viewed at following link that hits JSON API on the site and provides JSON data.

https://www.anz.com/productdata/productdata.asp?output=json&country=AU&section=PHL

You can parse the json and get the data. Also for HTTP requests I would recommend requests package.

As others said, the content is JavaScript generated, you can use selenium together ChromeDriver to find the data you want with something like:

from selenium import webdriver

driver = webdriver.Chrome()
driver.get("https://www.anz.com.au/personal/home-loans/your-loan/interest-rates/#varhome")

items = driver.find_elements_by_css_selector("span[data-item='rate']")
itemsText = [item.get_attribute("textContent") for item in items]

>>> itemsText
['5.20% p.a.', '5.30% p.a.', '5.75% p.a.', '5.52% p.a.', ....]

As seen above, BeautifulSoup wasn't necessary at all, but you can use it instead to parse the page source and get the same results:

from bs4 import BeautifulSoup
soup = BeautifulSoup(driver.page_source, 'html.parser')
items = soup.findAll("span",{"data-item":"rate"})
itemsText = [item.text for items in items]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM