简体   繁体   English

用Selenium和Python删除HTML的动态元素

[英]Delete dynamic elements from HTML with Selenium and Python

I've used BeautifulSoup to find a specific div class in the page's HTML. I want to check if this div has a span class inside it.我已经使用 BeautifulSoup 在页面的 HTML 中找到特定的 div class。我想检查这个 div 里面是否有跨度 class。 If the div has the span class, I want to maintain it on the page's code, but if it doesn't, I want to delete it, maybe using Selenium.如果 div 具有跨度 class,我想在页面代码中维护它,但如果没有,我想删除它,可能使用 Selenium。

For that I have two lists selecting the elements (div and span).为此,我有两个列表选择元素(div 和 span)。 I tried to check if one list is inside the other, and that kind of worked.我试图检查一个列表是否在另一个列表中,这种方法奏效了。 But how can one delete that found element from the page's source code?但是如何从页面的源代码中删除找到的元素呢?

Edit编辑

I've edited the code after a few conversations in the commentaries section.在评论部分进行了几次对话后,我编辑了代码。 With help, I was able to implement code to remove elements executing javascript.在帮助下,我能够实现代码以删除执行 javascript 的元素。

The code is running with no errors, but nothing is being deleted from the page.代码运行没有错误,但没有从页面中删除任何内容。

# Import required module
from selenium import webdriver 
from selenium.webdriver.chrome.options import Options
from webdriver_manager.chrome import ChromeDriverManager
import time

# Option to launch browser in incognito
options = Options()
options.add_argument("--incognito")
#options.add_argument("--headless")

# Using chrome driver
driver = webdriver.Chrome(ChromeDriverManager().install(), options=options)

# Web page url request
driver.get('https://www.facebook.com/ads/library/?active_status=all&ad_type=all&country=BR&q=frete%20gr%C3%A1tis%20aproveite&sort_data[direction]=desc&sort_data[mode]=relevancy_monthly_grouped&search_type=keyword_unordered&media_type=all')
driver.maximize_window()
time.sleep(10)

driver.execute_script("""
  for(let div of document.querySelectorAll('div._99s5')){
    let match = div.innerText.match(/(\d+) ads? use this creative and text/)
    let numAds = match ? parseInt(match[1]) : 0
    if(numAds < 10){
      div.querySelector(".tp-logo")?.remove()
    }
  }
""")

Since you're deleting them in javascript anyway:由于您无论如何都要在 javascript 中删除它们:

driver.execute_script("""
  for(let div of document.querySelectorAll('div._99s5')){
    let match = div.innerText.match(/(\d+) ads? use this creative and text/)
    let numAds = match ? parseInt(match[1]) : 0
    if(numAds < 10){
      div.querySelector(".tp-logo")?.remove()
    }
  }
""")

Note: Question and comments reads a bit confusing so it would be great to improve it a bit.注意:问题和评论读起来有点混乱,所以稍微改进一下会很好。 Assuming you like to decompose() some elements, the reason why or what to do after this action is not clear.假设你喜欢decompose()一些元素,这个动作之后的原因或做什么是不清楚的。 So this answer will only point out an apporache.所以这个答案只会指出一个apporache。

To decompose() the elements that do not contains ads use this creative and text just negate your selection and iterate the ResultSet :decompose()不包含ads use this creative and text ,只是否定您的选择并迭代ResultSet

for e in soup.select('div._99s5:has(:not(:-soup-contains("ads use this creative and text")))'):
    e.decompose()

Now these elements will no longer be included in your soup and you could process it for your needs.现在这些元素将不再包含在您的soup中,您可以根据需要对其进行处理。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM