簡體   English   中英

Python BeautifulSoup替換img src

[英]Python BeautifulSoup replace img src

我正在嘗試從網站解析HTML內容,更改href和img src。 一個href改變成功,但img src沒有。

它在變量中更改但在HTML(post_content)中沒有更改:

<p><img alt="alt text" src="https://lifehacker.ru/wp-content/uploads/2016/08/15120903sa_d2__1471520915-630x523.jpg" title="Title"/></p>

不_http://site.ru ...

<p><img alt="alt text" src="http://site.ru/wp-content/uploads/2016/08/15120903sa_d2__1471520915-630x523.jpg" title="Title"/></p>

我的代碼

if "app-store" not in url:
        r = requests.get("https://lifehacker.ru/2016/08/23/kak-vybrat-trimmer/")
        soup = BeautifulSoup(r.content)

        post_content = soup.find("div", {"class", "post-content"})
        for tag in post_content():
            for attribute in ["class", "id", "style", "height", "width", "sizes"]:
                del tag[attribute]

        for a in post_content.find_all('a'):
            a['href'] = a['href'].replace("https://lifehacker.ru", "http://site.ru")

        for img in post_content.find_all('img'):
            img_urls = img['src']
            if "https:" not in img_urls:
                img_urls="http:{}".format(img_urls)
            thumb_url = img_urls.split('/')
            urllib.urlretrieve(img_urls, "/Users/kr/PycharmProjects/education_py/{}/{}".format(folder_name, thumb_url[-1]))

            file_url = "/Users/kr/PycharmProjects/education_py/{}/{}".format(folder_name, thumb_url[-1])
            data = {
                'name': '{}'.format(thumb_url[-1]),
                'type': 'image/jpeg',
            }

            with open(file_url, 'rb') as img:
                data['bits'] = xmlrpc_client.Binary(img.read())


            response = client.call(media.UploadFile(data))

            attachment_url = response['url']


            img_urls = img_urls.replace(img_urls, attachment_url)



        [s.extract() for s in post_content('script')]
        post_content_insert = bleach.clean(post_content)
        post_content_insert = post_content_insert.replace('&lt;', '<')
        post_content_insert = post_content_insert.replace('&gt;', '>')

        print post_content_insert

看起來你永遠不會將img_urls分配給img['src'] 嘗試在塊結束時這樣做。

img_urls = img_urls.replace(img_urls, attachment_url)
img['src'] = img_urls

...但首先,您需要更改您的with語句,以便它為您的文件對象使用除img之外的某個名稱。 現在你正在掩蓋dom元素,你無法再訪問它。

        with open(file_url, 'rb') as some_file:
            data['bits'] = xmlrpc_client.Binary(some_file.read())

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM