簡體   English   中英

使用 Selenium、BeautifulSoup 和 Panda 從 JavaScript 網頁抓取表格

[英]Scraping tables from a JavaScript webpage using Selenium, BeautifulSoup, and Panda

首先,我是一名初學者,並試圖實現目前超出我能力范圍的目標。 不過,我希望你們能幫助我。 非常感激。

我正在嘗試從 spaclens.com 中刮掉桌子。 我已經嘗試使用 Google 表格中的開箱即用解決方案,但是該站點是 Java 腳本,它基於 Google 表格無法處理的腳本。 我在網上找到了一些代碼,我根據自己的需要進行了修改,但是我被卡住了。

import pandas as pd
from selenium import webdriver
from bs4 import BeautifulSoup

# Step 1: Create a session and load the page
driver = webdriver.Chrome()
driver.get('https://www.spaclens.com/')

# Wait for the page to fully load
driver.implicitly_wait(5)

# Step 2: Parse HTML code and grab tables with Beautiful Soup
soup = BeautifulSoup(driver.page_source, 'lxml')

tables = soup.find_all('table')

# Step 3: Read tables with Pandas read_html()
dfs = pd.read_html(str(tables))

print(f'Total tables: {len(dfs)}')
print(dfs[0])

driver.close()

上面的代碼給了我以下錯誤:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-34-a32c8dbcef38> in <module>
     16 
     17 # Step 3: Read tables with Pandas read_html()
---> 18 dfs = pd.read_html(str(tables))
     19 
     20 print(f'Total tables: {len(dfs)}')

~\anaconda3\lib\site-packages\pandas\util\_decorators.py in wrapper(*args, **kwargs)
    294                 )
    295                 warnings.warn(msg, FutureWarning, stacklevel=stacklevel)
--> 296             return func(*args, **kwargs)
    297 
    298         return wrapper

~\anaconda3\lib\site-packages\pandas\io\html.py in read_html(io, match, flavor, header, index_col, skiprows, attrs, parse_dates, thousands, encoding, decimal, converters, na_values, keep_default_na, displayed_only)
   1084         )
   1085     validate_header_arg(header)
-> 1086     return _parse(
   1087         flavor=flavor,
   1088         io=io,

~\anaconda3\lib\site-packages\pandas\io\html.py in _parse(flavor, io, match, attrs, encoding, displayed_only, **kwargs)
    915             break
    916     else:
--> 917         raise retained
    918 
    919     ret = []

~\anaconda3\lib\site-packages\pandas\io\html.py in _parse(flavor, io, match, attrs, encoding, displayed_only, **kwargs)
    896 
    897         try:
--> 898             tables = p.parse_tables()
    899         except ValueError as caught:
    900             # if `io` is an io-like object, check if it's seekable

~\anaconda3\lib\site-packages\pandas\io\html.py in parse_tables(self)
    215         list of parsed (header, body, footer) tuples from tables.
    216         """
--> 217         tables = self._parse_tables(self._build_doc(), self.match, self.attrs)
    218         return (self._parse_thead_tbody_tfoot(table) for table in tables)
    219 

~\anaconda3\lib\site-packages\pandas\io\html.py in _parse_tables(self, doc, match, attrs)
    545 
    546         if not tables:
--> 547             raise ValueError("No tables found")
    548 
    549         result = []

ValueError: No tables found

我是否需要更改參數才能找到表格? 任何人都可以對此有所了解嗎?

謝謝!!

更容易從源中獲取數據。 以漂亮的 json 格式提供給您。

import pandas as pd
import requests

url = 'https://www.spaclens.com/company/page'
payload = {
'pageIndex': '1',
'pageSize': '9999',
'query': '{}',
'sort': '{}'}
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.190 Safari/537.36'}

jsonData = requests.get(url, headers=headers, params=payload).json()
df = pd.DataFrame(jsonData['data']['items'])

Output:846行78列

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM