[英]Scraping tables from a JavaScript webpage using Selenium, BeautifulSoup, and Panda
首先,我是一名初學者,並試圖實現目前超出我能力范圍的目標。 不過,我希望你們能幫助我。 非常感激。
我正在嘗試從 spaclens.com 中刮掉桌子。 我已經嘗試使用 Google 表格中的開箱即用解決方案,但是該站點是 Java 腳本,它基於 Google 表格無法處理的腳本。 我在網上找到了一些代碼,我根據自己的需要進行了修改,但是我被卡住了。
import pandas as pd
from selenium import webdriver
from bs4 import BeautifulSoup
# Step 1: Create a session and load the page
driver = webdriver.Chrome()
driver.get('https://www.spaclens.com/')
# Wait for the page to fully load
driver.implicitly_wait(5)
# Step 2: Parse HTML code and grab tables with Beautiful Soup
soup = BeautifulSoup(driver.page_source, 'lxml')
tables = soup.find_all('table')
# Step 3: Read tables with Pandas read_html()
dfs = pd.read_html(str(tables))
print(f'Total tables: {len(dfs)}')
print(dfs[0])
driver.close()
上面的代碼給了我以下錯誤:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-34-a32c8dbcef38> in <module>
16
17 # Step 3: Read tables with Pandas read_html()
---> 18 dfs = pd.read_html(str(tables))
19
20 print(f'Total tables: {len(dfs)}')
~\anaconda3\lib\site-packages\pandas\util\_decorators.py in wrapper(*args, **kwargs)
294 )
295 warnings.warn(msg, FutureWarning, stacklevel=stacklevel)
--> 296 return func(*args, **kwargs)
297
298 return wrapper
~\anaconda3\lib\site-packages\pandas\io\html.py in read_html(io, match, flavor, header, index_col, skiprows, attrs, parse_dates, thousands, encoding, decimal, converters, na_values, keep_default_na, displayed_only)
1084 )
1085 validate_header_arg(header)
-> 1086 return _parse(
1087 flavor=flavor,
1088 io=io,
~\anaconda3\lib\site-packages\pandas\io\html.py in _parse(flavor, io, match, attrs, encoding, displayed_only, **kwargs)
915 break
916 else:
--> 917 raise retained
918
919 ret = []
~\anaconda3\lib\site-packages\pandas\io\html.py in _parse(flavor, io, match, attrs, encoding, displayed_only, **kwargs)
896
897 try:
--> 898 tables = p.parse_tables()
899 except ValueError as caught:
900 # if `io` is an io-like object, check if it's seekable
~\anaconda3\lib\site-packages\pandas\io\html.py in parse_tables(self)
215 list of parsed (header, body, footer) tuples from tables.
216 """
--> 217 tables = self._parse_tables(self._build_doc(), self.match, self.attrs)
218 return (self._parse_thead_tbody_tfoot(table) for table in tables)
219
~\anaconda3\lib\site-packages\pandas\io\html.py in _parse_tables(self, doc, match, attrs)
545
546 if not tables:
--> 547 raise ValueError("No tables found")
548
549 result = []
ValueError: No tables found
我是否需要更改參數才能找到表格? 任何人都可以對此有所了解嗎?
謝謝!!
更容易從源中獲取數據。 以漂亮的 json 格式提供給您。
import pandas as pd
import requests
url = 'https://www.spaclens.com/company/page'
payload = {
'pageIndex': '1',
'pageSize': '9999',
'query': '{}',
'sort': '{}'}
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.190 Safari/537.36'}
jsonData = requests.get(url, headers=headers, params=payload).json()
df = pd.DataFrame(jsonData['data']['items'])
Output:846行78列
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.