簡體   English   中英

bs4 python找不到文本

[英]bs4 python not finding text

我有一個html文件,我通過美麗的湯抓住。 html的摘錄在這個問題的底部。 我正在使用美麗的湯和硒。

有人告訴我,我只允許每小時提取這么多數據,當我讓這個頁面等待一段時間(一個小時)。

這就是我試圖提取數據的方式:

def get_page_data(self):
    opts = Options()
    opts.headless = True
    assert opts.headless  # Operating in headless mode
    browser_detail = Firefox(options=opts)
    url = self.base_url.format(str(self.tracking_id))
    print(url)
    browser_detail.get(url)
    self.page_data = bs4(browser_detail.page_source, 'html.parser')
    Error_Check = 1 if len(self.page_data.findAll(text='Error Report Number')) > 0 else 0
    Error_Check = 2 if len(self.page_data.findAll(text='exceeded the maximum number of sessions per hour allowed')) > 0 else Error_Check
    print(self.page_data.findAll(text='waiting an hour and trying your query again')). ##<<--- The Problem is this line.
    print(self.page_data)
    return Error_Check

問題是這一行:

print(self.page_data.findAll(text='waiting an hour and trying your query again')). ##<<--- The Problem is this line.

代碼無法在頁面中找到該行。 我錯過了什么? 謝謝

<html><head>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<link href="/CMPL/styles/ogm_style.css;jsessionid=rw9pc8-bncrIy_4KSZmJ8BxN2Z2hnKVwcr79Vho4-99gxTPrxNbo!-68716939" rel="stylesheet" type="text/css"/>
<body>
<!-- Content Area -->
<table style="width:100%; margin:auto;">
<tbody><tr valign="top">
<td class="ContentArea" style="width:100%;">
<span id="messageArea">
<!-- /tiles/messages.jsp BEGIN -->
<ul>
</ul><b>
</b><table style="width:100%; margin:auto; white-space: pre-wrap; text-align: left;">
<tbody><tr><td align="left"><b><li><font color="red"></font></li></b></td>
<td align="left"><font color="red">You have exceeded the maximum number of sessions per hour allowed for the public queries. You may still access the public</font></td>
</tr>
<tr><td><font color="red"><li style="list-style: none;"></li></font></td>
<td align="left"><font color="red">queries by waiting an hour and trying your query again. The RRC public queries are provided to facilitate online research and are not intended to be accessed by automated tools or scripts. For questions or concerns please contact the RRC HelpDesk at helpdesk@rrc.state.tx.us or 512-463-7229</font></td>
</tr>
</tbody></table>
<p>....more html...</p>
</body></html>

您可以使用以下css選擇器

tr:last-child:not([valign])

from bs4 import BeautifulSoup as bs
html = '''yourHTML'''    
soup = bs(html, 'lxml')   
item = soup.select_one('tr:last-child:not([valign])')
print(item.text)

如果這返回多個項目,您可以循環列表過濾包含感興趣字符串的項目。 你可以只限制td的選擇器並做類似的事情。

items = soup.select('tr:last-child:not([valign])')
for item in items:
    if 'queries by waiting an hour' in item.text:
        print(item.text)

BeautifulSoup 4.7.1

我不確定這是你在找什么,但如果你是:

html = [your code above]
from bs4 import BeautifulSoup as bs4
soup = bs4(html, 'lxml')
data = soup.find_all('font', color="red")
data[3].text

輸出:

'queries by waiting an hour and trying your query again. The RRC public queries are provided to facilitate online research and are not intended to be accessed by automated tools or scripts. For questions or concerns please contact the RRC HelpDesk at helpdesk@rrc.state.tx.us or 512-463-7229'

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM