简体   繁体   English

BeautifulSoup 试图从列表中删除 HTML 数据

[英]BeautifulSoup trying to remove HTML data from list

As mentioned above, I am trying to remove HTML from the printed output to just get text and my dividing |如上所述,我试图从打印的 output 中删除 HTML 以获取文本和我的划分 | and -.和 -。 I get span information as well as others that I would like to remove.我得到了跨度信息以及我想删除的其他信息。 As it is part of the program that is a loop, I cannot search for the individual text information of the page as they change.由于它是循环程序的一部分,因此我无法在页面的各个文本信息发生变化时搜索它们。 The page architecture stays the same, which is why printing the items in the list stays the same.页面架构保持不变,这就是打印列表中的项目保持不变的原因。 Wondering what would be the easiest way to clean the output.想知道清洁 output 的最简单方法是什么。 Here is the code section:这是代码部分:

        infoLink = driver.find_element_by_xpath("//a[contains(@href, '?tmpl=component&detail=true&parcel=')]").click()
        driver.switch_to.window(driver.window_handles[1])
        aInfo = driver.current_url
        data = requests.get(aInfo)
        src = data.text
        soup = BeautifulSoup(src, "html.parser")
        parsed = soup.find_all("td")
        for item in parsed:
            Original = (parsed[21])
            Owner = parsed[13]
            Address = parsed[17]
            print (*Original, "|",*Owner, "-",*Address)

Example output is:示例 output 是:

<span class="detail-text">123 Main St</span> | <span class="detail-text">Banner,Bruce</span> - <span class="detail-text">1313 Mockingbird Lane<br>Santa Monica, CA  90405</br></span>

Thank you!谢谢!

To get the text between the tags just use get_text() but you should be aware, that there is always text between the tags to avoid errors:要获取标签之间的文本,只需使用get_text()但您应该注意,标签之间总是有文本以避免错误:

for item in parsed:
    Original = (parsed[21].get_text(strip=True))
    Owner = parsed[13].get_text(strip=True)
    Address = parsed[17].get_text(strip=True)

I wrote an algorithm recently that does something like this.我最近写了一个算法来做这样的事情。 It won't work if your target text has a < or a > in it, though.但是,如果您的目标文本中包含 < 或 >,它将不起作用。

def remove_html_tags(string):
    data = string.replace(string[string.find("<"):string.find(">") + 1], '').strip()
    if ">" in data or "<" in data:
        return remove_html_tags(data)
    else:
        return str(data)

It recursively removes the text between < and > , inclusive.它递归地删除<>之间的文本,包括在内。

Let me know if this works!让我知道这个是否奏效!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM