簡體   English   中英

Python、BeautifulSoup、iMAP 和請求

[英]Python, BeautifulSoup, iMAP and Requests

我正在編寫一個程序,它可以抓取電子郵件並從特定電子郵件(如 Newegg)中檢索鏈接。 我可以使用 iMAP 登錄並獲得 html 編碼,但是每當我在 BeautifulSoup 中實際找到 html 代碼時,href 不是實際鏈接。 下面的代碼是我正在使用的。

for part in email_message.walk():
    if part.get_content_maintype() == "multipart":
        continue
filename = part.get_filename()
    if not filename:
        ext = '.html'
        filename = 'msg-part-%08d%s' %(counter, ext)
        counter += 1
    content_type = part.get_content_type()
    print(content_type)
    if "html" in content_type:
        html_ = part.get_payload()
        soup = BeautifulSoup(html_, 'html.parser')
        print(soup.prettify())

我找到了我需要的特定“a”元素,但信息完全被打亂了。

<a 003e40bf801121da0c8026371fac2881d7ada96d0f3044bd52ea90c6f133482b7ac302="A6D66F63B93C9324A61A7BF6F53C6AD3B9FCE2E9722F2737F5C2BDEEC024FDB8BB1A8A=" 2='55);"' 255,="" 300px;="" 6035c14bccf6122e13d0c10420388428d9f6da5e55d8c39a946196846f58cb49754b4b="624023C5DA52531A16144B86B446AD159FC1B545111C71DA86A4B2BAB81E3BC6809F33=" a1c0a2d2231e3aa2c54f352a6ead95f6a0d3cad4f9d4610f9475ac"="" block;="" c41a281473fd50144612e9e4807a3c31920713f3fa05e6e851e830e5aaed502d2057db="5696E077EDBD8839ED7D1D9DFE6BCF861A354D0E6FB52F0818DB4C7971AB53C7F73CBB=" color:="" decoration:="" display:="" hre='f=3D"https://www.newegg.com/mr/DAC2DEFCF7F955FE4BCD7E3DF4926A74/D1F70B=' none;="" rgb(255,="" style='3D"text-=' target='3D"_blank"' width:="">
Verify My Email
<img 2018="" 6px;"="" :="" border='3D"0"' icon_arrow.='png"' images="" neemail="" promotions.newegg.com="" src='3D"https=' style='3D"width:' transactional="" width='3D"6"'/>
</a>

如果我用檢查元素提取它,這就是實際的 html 的樣子。

<a href="https://www.newegg.com/mr/DAC2DEFCF7F955FE4BCD7E3DF4926A74/D1F70BC41A281473FD50144612E9E4807A3C31920713F3FA05E6E851E830E5AAED502D2057DB5696E077EDBD8839ED7D1D9DFE6BCF861A354D0E6FB52F0818DB4C7971AB53C7F73CBB003E40BF801121DA0C8026371FAC2881D7ADA96D0F3044BD52EA90C6F133482B7AC302A6D66F63B93C9324A61A7BF6F53C6AD3B9FCE2E9722F2737F5C2BDEEC024FDB8BB1A8A6035C14BCCF6122E13D0C10420388428D9F6DA5E55D8C39A946196846F58CB49754B4B624023C5DA52531A16144B86B446AD159FC1B545111C71DA86A4B2BAB81E3BC6809F33A1C0A2D2231E3AA2C54F352A6EAD95F6A0D3CAD4F9D4610F9475AC" style="" target="_blank" data-saferedirecturl="https://www.google.com/url?q=https://www.newegg.com/mr/DAC2DEFCF7F955FE4BCD7E3DF4926A74/D1F70BC41A281473FD50144612E9E4807A3C31920713F3FA05E6E851E830E5AAED502D2057DB5696E077EDBD8839ED7D1D9DFE6BCF861A354D0E6FB52F0818DB4C7971AB53C7F73CBB003E40BF801121DA0C8026371FAC2881D7ADA96D0F3044BD52EA90C6F133482B7AC302A6D66F63B93C9324A61A7BF6F53C6AD3B9FCE2E9722F2737F5C2BDEEC024FDB8BB1A8A6035C14BCCF6122E13D0C10420388428D9F6DA5E55D8C39A946196846F58CB49754B4B624023C5DA52531A16144B86B446AD159FC1B545111C71DA86A4B2BAB81E3BC6809F33A1C0A2D2231E3AA2C54F352A6EAD95F6A0D3CAD4F9D4610F9475AC&amp;source=gmail&amp;ust=1641074397136000&amp;usg=AOvVaw0KGbJN7dTvHZEejpneFiwb" xpath="1">
Verify My Email
<img border="0" src="https://ci5.googleusercontent.com/proxy/_AqrkZlchXMl0NMIJmmpeVH9ePljKIfp9UFOSxeynkr2vxupXQV1LDQqi3y8DDhGWlAfsZ0hZ8VKMRBhrnDFeeqT6rZsn3ypYcFKSEgx_gSfxIUYFJJfeS1JHxetYJiOYA=s0-d-e1-ft#https://promotions.newegg.com/NEemail/transactional/images/2018/icon_arrow.png" style="width:6px" width="6" class="CToWUd">
</a>

我究竟做錯了什么? 為什么我的 href 實際上沒有作為 Newegg 鏈接出現? 我可以從 BeautifulSoup 那里得到一點點並解擾它,但這不是我想要做的。

我需要能夠獲取鏈接才能在該鏈接上運行請求。

弄清楚了。 改變這個

html_ = part.get_payload()

對此

html_ = part.get_payload(decode=True)

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM