Beautiful Soup 無法刪除所有腳本標簽

Question

我正在玩 bs4 並試圖抓取以下網站： https://pythonbasics.org/selenium-get-html/我想從 html 中刪除所有腳本標簽。

要刪除腳本標簽，我使用了以下功能：

for script in soup("script"):
     script.decompose()

或者

[s.extract() for s in soup.findAll('script')]

以及我在網上找到的許多其他人。 它們都用於相同的目的，但是它們無法刪除腳本標簽，例如：

<script src="/lib/jquery.js"></script>
<script src="/lib/waves.js"></script>
<script src="/lib/jquery-ui.js"></script>
<script src="/lib/jquery.tocify.js"></script>

<script src="/js/main.js"></script>
<script src="/lib/toc.js"></script>

或者

<div id="disqus_thread"></div>
    <script>
        /**
         *  RECOMMENDED CONFIGURATION VARIABLES: EDIT AND UNCOMMENT THE SECTION BELOW TO INSERT DYNAMIC VALUES FROM YOUR PLATFORM OR CMS.
         *  LEARN WHY DEFINING THESE VARIABLES IS IMPORTANT: https://disqus.com/admin/universalcode/#configuration-variables
         */
        
        var disqus_config = function () {
            this.page.url = 'https://pythonbasics.org/selenium-get-html/';  // Replace PAGE_URL with your page's canonical URL variable
            this.page.identifier = '_posts/selenium-get-html.md'; // Replace PAGE_IDENTIFIER with your page's unique identifier variable
        };
        
        (function() {  // DON'T EDIT BELOW THIS LINE
            var d = document, s = d.createElement('script');
            
            s.src = '//https-pythonbasics-org.disqus.com/embed.js';
            
            s.setAttribute('data-timestamp', +new Date());
            (d.head || d.body).appendChild(s);
        })();
    </script>
    <noscript>Please enable JavaScript to view the <a href="https://disqus.com/?ref_noscript" rel="nofollow">comments powered by Disqus.</a></noscript>

這里發生了什么？ 我發現了一些相關的問題：

beautifulsoup 去掉所有內部 javascript

BeatifulSoup4 get_text 還有 javascript

但是答案推薦了我用來清理這些腳本並失敗的相同算法。 評論中還有其他人和我一樣被卡住了。

我尋找要使用的 nltk 以前的函數，但它們似乎不再有效。 你有什么想法？ 為什么這些函數無法刪除所有腳本標簽。 沒有正則表達式我們能做什么？

Answer 1

發生這種情況是因為某些<script>標記位於 HTML 注釋 ( <.--... --> ) 中。

您可以提取這些 HTML 評論，檢查標簽是否屬於Comment類型：

from bs4 import BeautifulSoup, Comment

soup = BeautifulSoup(html, "html.parser")

# Find all comments on the website and remove them, most of them contain `script` tags
[
    comment.extract()
    for comment in soup.findAll(text=lambda text: isinstance(text, Comment))
]

# Find all other `script` tags and remove them
[tag.extract() for tag in soup.findAll("script")]

print(soup.prettify())

Beautiful Soup 無法刪除所有腳本標簽

問題描述

1 個解決方案

解決方案1
1 已采納 2021-05-25 18:38:40

Beautiful Soup 無法刪除所有腳本標簽

問題描述

1 個解決方案

解決方案1 1 已采納 2021-05-25 18:38:40

解決方案1
1 已采納 2021-05-25 18:38:40