從表格中抓取數據

Question

首先，我嘗試使用bs4，但是該表不是純HTML文本，這就是為什么我移至selenium

我正在嘗試抓取表數據，但是我不知道如何獲取信息。

我現在所擁有的是：

table =  browser.find_element_by_id("name_list")  
cell = table.find_elements_by_xpath("//td[@style='text-align:center']")

表數據顯示如下：

<td style="text-align:center" class="left"><script   
type="text/javascript">document.write(Base64.decode("MTA0LjI0OC4xMTUuMjM2"))</script>"John"</td>

我想得到“約翰”，但是我怎么得到呢？

Answer 1

你可以用BeautifulSoup做

如果<td>有<script> ，則可以使用迭代器.children並獲取第二個/最后一個元素（第一個元素是<script> ）

from bs4 import BeautifulSoup as BS

html = '''<td style="text-align:center" class="left"><script   
type="text/javascript">document.write(Base64.decode("MTA0LjI0OC4xMTUuMjM2"))</script>"John"</td>'''

soup = BS(html, 'html.parser')
td = soup.find('td')

text = list(td.children)[1]

print(text) # John

或者您可以找到<script>並將其extract出來，這樣您的<td>僅包含文本

from bs4 import BeautifulSoup as BS

html = '''<td style="text-align:center" class="left"><script   
type="text/javascript">document.write(Base64.decode("MTA0LjI0OC4xMTUuMjM2"))</script>"John"</td>'''

soup = BS(html, 'html.parser')
td = soup.find('td')

td.find('script').extract()
text = td.text

print(td.text) # John

如果需要Base64.decode("MTA0LjI0OC4xMTUuMjM2")文本，則可以找到<script>並將其作為文本獲取。 使用切片，您可以獲取文本MTA0LjI0OC4xMTUuMjM2並使用base64模塊進行解碼。 您會收到文本104.248.115.236

from bs4 import BeautifulSoup as BS
import base64

html = '''<td style="text-align:center" class="left"><script   
type="text/javascript">document.write(Base64.decode("MTA0LjI0OC4xMTUuMjM2"))</script>"John"</td>'''

soup = BS(html, 'html.parser')
td = soup.find('td')

script = td.find('script').text

text = script[30:-3]

text = base64.b64decode(text).decode()

print(text) # 104.248.115.236

Answer 2

您可以使用以下行獲取文本。

table.find_element_by_xpath(".//td[@style='text-align:center']").text

確保。 xpath中有將范圍限制為當前表節點的文件。

從表格中抓取數據

問題描述

2 個解決方案

解決方案1
1 已采納 2019-07-12 21:21:43

解決方案2
0 2019-07-12 21:14:12

從表格中抓取數據

問題描述

2 個解決方案

解決方案1 1 已采納 2019-07-12 21:21:43

解決方案2 0 2019-07-12 21:14:12

解決方案1
1 已采納 2019-07-12 21:21:43

解決方案2
0 2019-07-12 21:14:12