简体   繁体   中英

How to scrape JavaScript table from website to dataframe?

I am trying to scrape a JavaScript table from a website to a dataframe. The soup outputs only the script location and not access to the table. The MWE and soup output are given below. I am trying to scrape the table from here to a dataframe, is this possible and how?

MWE

import requests
from bs4 import BeautifulSoup
import pandas as pd

headers = {'user-agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) \
                Chrome/72.0.3626.28 Safari/537.36'}
session = requests.Session()
website = session.get('https://iborrowdesk.com', headers=headers, timeout=10)
website.raise_for_status()
soup = BeautifulSoup(website.text, 'lxml')
table = soup.find('table', class_='table table-condensed table-hover')
data = pd.read_html(str(table))[0]

Soup output

<html><head><link href="/apple-touch-icon.png" rel="apple-touch-icon" sizes="180x180"/>
<link href="/favicon-32x32.png" rel="icon" sizes="32x32" type="image/png"/>
<link href="/favicon-16x16.png" rel="icon" sizes="16x16" type="image/png"/>
<link href="/site.webmanifest" rel="manifest"/>
<link color="#5bbad5" href="/safari-pinned-tab.svg" rel="mask-icon"/>
<meta content="#da532c" name="msapplication-TileColor"/>
<meta content="#ffffff" name="theme-color"/>
<link href="https://maxcdn.bootstrapcdn.com/bootswatch/3.3.6/flatly/bootstrap.min.css" rel="stylesheet"/>
<meta charset="utf-8"/><meta content="width=device-width,initial-scale=1" name="viewport"/>
<title>IBorrowDesk</title><script src="//cdn.thisiswaldo.com/static/js/9754.js"></script>
</head><body><div class="container"></div><script src="/static/main.bundle.js?39ed89dd02e44899ebb4">
</script></body></html>

You can use requests since they are exposing an api.

import json

import pandas as pd
import requests


def get_data() -> pd.DataFrame:
    url = "https://iborrowdesk.com/api/most_expensive"

    with requests.Session() as request:
        response = request.get(url, timeout=10)
    if response.status_code != 200:
        print(response.raise_for_status())

    data = json.loads(response.text)

    return pd.json_normalize(data=data["results"])


df = get_data()

As Jason Baker mentioned in his post, you can use the API that's provided. Alternatively, you can use Selenium to scrape the data as well. This question ( Python webscraping: BeautifulSoup not showing all html source content ) is relevant to your question. It contains an explanation of why requests.Session().get(url) is unable to retrieve all of the elements in the DOM. It's because the elements are created using JavaScript, so the page source HTML doesn't initially contain those elements, they're inserted using JavaScript. The question I linked also contains a code snippet in the answers that I've updated to match your question:

from selenium import webdriver
from selenium.webdriver.common.by import By
import pandas as pd

browser = webdriver.Firefox()
browser.get('https://iborrowdesk.com/')
table = browser.find_element(By.TAG_NAME, 'table').get_attribute("outerHTML")
data = pd.read_html(table)[0]
print(data)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM