BeautifulSoup 如何獲取包含特定文本的父元素標簽？試圖抓取 email 但無法拾取父元素標簽

Question

我正在嘗試從頁面中抓取 email 地址，並且在獲取包含 email '@' 符號的父元素時遇到了一些麻煩。 電子郵件嵌入在不同的元素標簽中，因此我無法將它們挑選出來。 我必須通過 go 大約 50,000 頁左右。

url = 'https://sec.report/Document/0001078782-20-000134/#f10k123119_ex10z22.htm'

以下是一些示例（一對來自我必須抓取的不同頁面）：

<div style="border-bottom:1px solid #000000">**dbrenner@umich.edu**</div>

<div class="f3c-8"><u**>Bob@LifeSciAdvisors.com**</u></div>

<p style="margin-bottom:0pt;margin-top:0pt;;text-indent:0pt;;font-family:Arial;font-size:11pt;font-weight:normal;font-style:normal;text-transform:none;font-variant: normal;">Email: **dmoskowitz@biocept.com**; Phone: 858-320-8244</p>

<td class="f8c-43">E-mail: <u>jcohen@2020gene.com</u></td>

<p class="f7c-4">Email: jcohen@2020gene.com</p>

我試過的：

我嘗試 find_all('div') 來獲取所有 div 的 ResultSet 以獲取其中包含“@”符號的那些。

div = page.find_all('div')
for each in div:
    if '@' in each.text: 
        print(each.text)

當我這樣做時，由於正文位於“div”中，它打印了整個頁面。 失敗。 由於電子郵件嵌入在不同的標簽中，這種方法似乎效率低下

使用正則表達式。 我嘗試使用正則表達式來挑選電子郵件，但它會得到一堆不可用的文本，我必須手動拆分、替換字符等。這對 go 來說似乎是一項艱巨的任務，通過所有不同的場景。

    import re
    emails = re.findall('\S+@\S+', str(page))
    for each in emails:
        print(each)

這樣做給了我這樣的東西：

hidden;}@media
#000000">dbrenner@umich.edu</div>
#000000">kherman@umich.edu
#000000">spage@fredhutch.org</div>
#000000">mtuck@umich.edu</div>
#000000">jdahlgre@fredhutch.org</div></p>
#000000">lafky.jacqueline@mayo.edu</div></p>
mtuck@umich.edu)</div>
#000000">ctsucontact@westat.com</div>.
href="http://@umich.edu">@umich.edu</a></li><li><a

現在我可以使用 go 並使用.split('<') 拆分一些文本，然后再次拆分，等等，但它們並不完全相同，因為我必須在每頁中刮掉 50,000 多頁和 100 個條目，所以有很多我不得不刮和考慮。

我嘗試在 google 和 stackoverflow 上查找，但我能找到的只是人們在某個元素中尋找文本的解決方案，等等。

我需要的是“如何找到包含電子郵件的父元素”

I don't think I would need to use Selenium for this since the issue would be similar to using Beautifulsoup and the site is not JavaScript rendered other than some of the pages being a pdf, which is whole another issue.

任何見解，幫助或建議表示贊賞。 謝謝。

Answer 1

有兩個選項可以搜索包含@符號的文本：

使用 CSS 選擇器:contains(<MY TEXT>)搜索其中包含@符號的文本。
在find_all()方法中使用lambda function ，並搜索@是否在soup的.text()中。

選項1：

from bs4 import BeautifulSoup


html = """<div style="border-bottom:1px solid #000000">**dbrenner@umich.edu**</div>

<div class="f3c-8"><u**>Bob@LifeSciAdvisors.com**</u></div>

<p style="margin-bottom:0pt;margin-top:0pt;;text-indent:0pt;;font-family:Arial;font-size:11pt;font-weight:normal;font-style:normal;text-transform:none;font-variant: normal;">Email: **dmoskowitz@biocept.com**; Phone: 858-320-8244</p>

<td class="f8c-43">E-mail: <u>jcohen@2020gene.com</u></td>

<p class="f7c-4">Email: jcohen@2020gene.com</p>"""

soup = BeautifulSoup(html, "html.parser")

for tag in soup.select('*:contains("@")'):
    print(tag.text.strip())

選項 2：

for tag in soup.find_all(lambda t: "@" in t.text.strip()):
    print(tag.text.strip())

BeautifulSoup 如何獲取包含特定文本的父元素標簽？試圖抓取 email 但無法拾取父元素標簽

問題描述

1 個解決方案

解決方案1
0 2020-11-29 00:50:59

BeautifulSoup 如何獲取包含特定文本的父元素標簽？ 試圖抓取 email 但無法拾取父元素標簽

問題描述

1 個解決方案

解決方案1 0 2020-11-29 00:50:59

BeautifulSoup 如何獲取包含特定文本的父元素標簽？試圖抓取 email 但無法拾取父元素標簽

解決方案1
0 2020-11-29 00:50:59