简体   繁体   English

如何用最终重定向替换 HTML 中的所有 URL?

[英]How do I replace all URLs in HTML with their final redirect?

Preferably using BeautifulSoup, as I'm already using it for other purposes.最好使用 BeautifulSoup,因为我已经将它用于其他目的。 But any Python solution is fine.但是任何 Python 解决方案都可以。

    s = BeautifulSoup(bodyhtml, features="lxml")
    items = s.find_all("div", {"class": "text-block"})
    # I want to replace all URLs in `items` with their final redirect.

Here is a sample URL:这是一个示例 URL:

https://tracking.tldrnewsletter.com/CL0/https:%2F%2Farstechnica.com%2Finformation-technology%2F2020%2F04%2Fmeet-dark_nexus-quite-possibly-the-most-potent-iot-botnet-ever%2F/1/0100017163ab9f84-cfdbd3c3-ef8c-4b34-b2a0-f6f4b8f78359-000000/BEB0JUmMqamX4piPthkn_oJ78cjvd6UocEmGf7iO5Pk=136

Here is item[5] (All items are alike):这是item[5] (所有项目都相同):

<div class="text-block"><span style="color: rgb(0, 0, 0);"><a href="https://tracking.tldrnewsletter.com/CL0/https:%2F%2Fwww.polygon.com%2F2020%2F4%2F8%2F21213551%2Fgoogle-stadia-free-pro-subscription/1/010001715e86638d-8bd389c9-f9eb-4b68-ade4-c2d706ea5ecb-000000/J3pqLEKSYUvxNOcq8090EHiTSXXHiZtRNM6JD1aQP8s=136"><span style="font-size: 14px;"><strong>Google Stadia now free to anyone with a Gmail address (2 minute read)</strong></span></a><br/><br/><span style='font-size: 14px; font-family: "Helvetica Neue", Helvetica, Arial, Verdana, sans-serif;'>Google Stadia is now free to anyone with a Gmail address. New users will receive two months of Stadia Pro for free. Existing Stadia Pro users won't be charged for the next two months. Nine games are included with the offer. Access to Stadia previously required purchasing a Google Stadia Premier Edition bundle for $129. Stadia Pro will cost $9.99 a month after the two-month trial period ends. Users can cancel their subscriptions online at any time.</span><br/></span><br/></div>

Get the relevant a elements.获取相关a元素。 Replace the prefix to the href attribute with an empty string, assuming the prefixes are all the same.假设前缀都相同,将href属性的前缀替换为空字符串。 Get rid of anything following the first /.摆脱第一个 / 之后的任何内容。 Then un-escape it like this:然后像这样取消转义:

from bs4 import BeautifulSoup
from urllib.parse import unquote


html = """
<head>

    <body>
        <p>
            <div class="text-block"><span style="color: rgb(0, 0, 0);"><a href="https://tracking.tldrnewsletter.com/CL0/https:%2F%2Fwww.polygon.com%2F2020%2F4%2F8%2F21213551%2Fgoogle-stadia-free-pro-subscription/1/010001715e86638d-8bd389c9-f9eb-4b68-ade4-c2d706ea5ecb-000000/J3pqLEKSYUvxNOcq8090EHiTSXXHiZtRNM6JD1aQP8s=136"><span style="font-size: 14px;"><strong>Google Stadia now free to anyone with a Gmail address (2 minute read)</strong></span></a>
                <br/>
                <br/><span style='font-size: 14px; font-family: "Helvetica Neue", Helvetica, Arial, Verdana, sans-serif;'>Google Stadia is now free to anyone with a Gmail address. New users will receive two months of Stadia Pro for free. Existing Stadia Pro users won't be charged for the next two months. Nine games are included with the offer. Access to Stadia previously required purchasing a Google Stadia Premier Edition bundle for $129. Stadia Pro will cost $9.99 a month after the two-month trial period ends. Users can cancel their subscriptions online at any time.</span>
                <br/>
                </span>
                <br/>
            </div>
        </p>

        </body>
</head>
"""

s = BeautifulSoup(html, features="lxml")
for a in s.select('div.text-block a'):
        a['href'] = unquote(a['href'].replace("https://tracking.tldrnewsletter.com/CL0/", "").split('/')[0])
print(s)

Outputs:输出:

    <html><head>
</head><body>
<p>
</p><div class="text-block"><span style="color: rgb(0, 0, 0);"><a href="https://www.polygon.com/2020/4/8/21213551/google-stadia-free-pro-subscription"><span style="font-size: 14px;"><strong>Google Stadia now free to anyone with a Gmail address (2 minute read)</strong></span></a>
<br/>
<br/><span style='font-size: 14px; font-family: "Helvetica Neue", Helvetica, Arial, Verdana, sans-serif;'>Google Stadia is now free to anyone with a Gmail address. New users will receive two months of Stadia Pro for free. Existing Stadia Pro users won't be charged for the next two months. Nine games are included with the offer. Access to Stadia previously required purchasing a Google Stadia Premier Edition bundle for $129. Stadia Pro will cost $9.99 a month after the two-month trial period ends. Users can cancel their subscriptions online at any time.</span>
<br/>
</span>
<br/>
</div>
</body>
</html>

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM