简体   繁体   English

如何从网站上抓取 JavaScript 表到 dataframe?

[英]How to scrape JavaScript table from website to dataframe?

I am trying to scrape a JavaScript table from a website to a dataframe. The soup outputs only the script location and not access to the table.我正在尝试将 JavaScript 表从网站抓取到 dataframe。汤仅输出脚本位置,而不输出对表的访问权限。 The MWE and soup output are given below.下面给出了 MWE 和汤 output。 I am trying to scrape the table from here to a dataframe, is this possible and how?我正在尝试从此处将表格抓取到 dataframe,这可能吗?如何实现?

MWE MWE

import requests
from bs4 import BeautifulSoup
import pandas as pd

headers = {'user-agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) \
                Chrome/72.0.3626.28 Safari/537.36'}
session = requests.Session()
website = session.get('https://iborrowdesk.com', headers=headers, timeout=10)
website.raise_for_status()
soup = BeautifulSoup(website.text, 'lxml')
table = soup.find('table', class_='table table-condensed table-hover')
data = pd.read_html(str(table))[0]

Soup output汤output

<html><head><link href="/apple-touch-icon.png" rel="apple-touch-icon" sizes="180x180"/>
<link href="/favicon-32x32.png" rel="icon" sizes="32x32" type="image/png"/>
<link href="/favicon-16x16.png" rel="icon" sizes="16x16" type="image/png"/>
<link href="/site.webmanifest" rel="manifest"/>
<link color="#5bbad5" href="/safari-pinned-tab.svg" rel="mask-icon"/>
<meta content="#da532c" name="msapplication-TileColor"/>
<meta content="#ffffff" name="theme-color"/>
<link href="https://maxcdn.bootstrapcdn.com/bootswatch/3.3.6/flatly/bootstrap.min.css" rel="stylesheet"/>
<meta charset="utf-8"/><meta content="width=device-width,initial-scale=1" name="viewport"/>
<title>IBorrowDesk</title><script src="//cdn.thisiswaldo.com/static/js/9754.js"></script>
</head><body><div class="container"></div><script src="/static/main.bundle.js?39ed89dd02e44899ebb4">
</script></body></html>

You can use requests since they are exposing an api.您可以使用请求,因为它们公开了 api。

import json

import pandas as pd
import requests


def get_data() -> pd.DataFrame:
    url = "https://iborrowdesk.com/api/most_expensive"

    with requests.Session() as request:
        response = request.get(url, timeout=10)
    if response.status_code != 200:
        print(response.raise_for_status())

    data = json.loads(response.text)

    return pd.json_normalize(data=data["results"])


df = get_data()

As Jason Baker mentioned in his post, you can use the API that's provided.正如 Jason Baker 在他的帖子中提到的,您可以使用提供的 API。 Alternatively, you can use Selenium to scrape the data as well.或者,您也可以使用 Selenium 来抓取数据。 This question ( Python webscraping: BeautifulSoup not showing all html source content ) is relevant to your question.这个问题( Python webscraping: BeautifulSoup not showing all html source content )与你的问题相关。 It contains an explanation of why requests.Session().get(url) is unable to retrieve all of the elements in the DOM.它解释了为什么 requests.Session().get(url) 无法检索 DOM 中的所有元素。 It's because the elements are created using JavaScript, so the page source HTML doesn't initially contain those elements, they're inserted using JavaScript. The question I linked also contains a code snippet in the answers that I've updated to match your question:这是因为元素是使用 JavaScript 创建的,所以页面源代码 HTML 最初不包含这些元素,它们是使用 JavaScript 插入的。我链接的问题还在答案中包含一个代码片段,我已经更新以匹配你的问题:

from selenium import webdriver
from selenium.webdriver.common.by import By
import pandas as pd

browser = webdriver.Firefox()
browser.get('https://iborrowdesk.com/')
table = browser.find_element(By.TAG_NAME, 'table').get_attribute("outerHTML")
data = pd.read_html(table)[0]
print(data)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM