使用 Selenium、BeautifulSoup 和 Panda 從 JavaScript 網頁抓取表格

Question

首先，我是一名初學者，並試圖實現目前超出我能力范圍的目標。 不過，我希望你們能幫助我。 非常感激。

我正在嘗試從 spaclens.com 中刮掉桌子。 我已經嘗試使用 Google 表格中的開箱即用解決方案，但是該站點是 Java 腳本，它基於 Google 表格無法處理的腳本。 我在網上找到了一些代碼，我根據自己的需要進行了修改，但是我被卡住了。

import pandas as pd
from selenium import webdriver
from bs4 import BeautifulSoup

# Step 1: Create a session and load the page
driver = webdriver.Chrome()
driver.get('https://www.spaclens.com/')

# Wait for the page to fully load
driver.implicitly_wait(5)

# Step 2: Parse HTML code and grab tables with Beautiful Soup
soup = BeautifulSoup(driver.page_source, 'lxml')

tables = soup.find_all('table')

# Step 3: Read tables with Pandas read_html()
dfs = pd.read_html(str(tables))

print(f'Total tables: {len(dfs)}')
print(dfs[0])

driver.close()

上面的代碼給了我以下錯誤：

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-34-a32c8dbcef38> in <module>
     16 
     17 # Step 3: Read tables with Pandas read_html()
---> 18 dfs = pd.read_html(str(tables))
     19 
     20 print(f'Total tables: {len(dfs)}')

~\anaconda3\lib\site-packages\pandas\util\_decorators.py in wrapper(*args, **kwargs)
    294                 )
    295                 warnings.warn(msg, FutureWarning, stacklevel=stacklevel)
--> 296             return func(*args, **kwargs)
    297 
    298         return wrapper

~\anaconda3\lib\site-packages\pandas\io\html.py in read_html(io, match, flavor, header, index_col, skiprows, attrs, parse_dates, thousands, encoding, decimal, converters, na_values, keep_default_na, displayed_only)
   1084         )
   1085     validate_header_arg(header)
-> 1086     return _parse(
   1087         flavor=flavor,
   1088         io=io,

~\anaconda3\lib\site-packages\pandas\io\html.py in _parse(flavor, io, match, attrs, encoding, displayed_only, **kwargs)
    915             break
    916     else:
--> 917         raise retained
    918 
    919     ret = []

~\anaconda3\lib\site-packages\pandas\io\html.py in _parse(flavor, io, match, attrs, encoding, displayed_only, **kwargs)
    896 
    897         try:
--> 898             tables = p.parse_tables()
    899         except ValueError as caught:
    900             # if `io` is an io-like object, check if it's seekable

~\anaconda3\lib\site-packages\pandas\io\html.py in parse_tables(self)
    215         list of parsed (header, body, footer) tuples from tables.
    216         """
--> 217         tables = self._parse_tables(self._build_doc(), self.match, self.attrs)
    218         return (self._parse_thead_tbody_tfoot(table) for table in tables)
    219 

~\anaconda3\lib\site-packages\pandas\io\html.py in _parse_tables(self, doc, match, attrs)
    545 
    546         if not tables:
--> 547             raise ValueError("No tables found")
    548 
    549         result = []

ValueError: No tables found

我是否需要更改參數才能找到表格？ 任何人都可以對此有所了解嗎？

謝謝！！

Answer 1

更容易從源中獲取數據。 以漂亮的 json 格式提供給您。

import pandas as pd
import requests

url = 'https://www.spaclens.com/company/page'
payload = {
'pageIndex': '1',
'pageSize': '9999',
'query': '{}',
'sort': '{}'}
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.190 Safari/537.36'}

jsonData = requests.get(url, headers=headers, params=payload).json()
df = pd.DataFrame(jsonData['data']['items'])

Output：846行78列

使用 Selenium、BeautifulSoup 和 Panda 從 JavaScript 網頁抓取表格

問題描述

1 個解決方案

解決方案1
0 已采納 2021-03-24 14:11:45

使用 Selenium、BeautifulSoup 和 Panda 從 JavaScript 網頁抓取表格

問題描述

1 個解決方案

解決方案1 0 已采納 2021-03-24 14:11:45

解決方案1
0 已采納 2021-03-24 14:11:45