简体   繁体   中英

how to exclude nested tags from a parent tag to just get the ouput as text skipping the links (a) tags

I want to exclude the included nested tags like in this case ignore the a tags "links" associated with the word-

base_url="https://www.usatoday.com/story/tech/2020/09/17/qanon-conspiracy-theories-debunked-social-media/5791711002/"
response=requests.get(base_url)

html=response.content

bs=BeautifulSoup(html,parser="lxml")

article=bs.find_all("article",{"class":"gnt_pr"})

body=article[0].find_all("p",{"class":"gnt_ar_b_p"})

output is-

[<p class="gnt_ar_b_p">An emboldened community of believers known as QAnon is spreading a baseless patchwork of conspiracy theories that are fooling Americans who are looking for simple answers in a time of intense political polarization, social isolation and economic turmoil.</p>,
 <p class="gnt_ar_b_p">Experts call QAnon <a class="gnt_ar_b_a" data-t-l="|inline|intext|n/a" href="https://www.usatoday.com/in-depth/tech/2020/08/31/qanon-conspiracy-theories-trump-election-covid-19-pandemic-extremist-groups/5662374002/" rel="noopener" target="_blank">a "digital cult"</a> because of its pseudo-religious qualities and an extreme belief system that enthrones President Donald Trump as a savior figure crusading against evil.</p>,
 <p class="gnt_ar_b_p">The core of QAnon is the false theory that Trump was elected to root out a secret child-sex trafficking ring run by Satanic, cannibalistic Democratic politicians and celebrities. Although it may sound absurd, it has nonetheless attracted devoted followers who have begun to perpetuate other theories that they suggest, imply or argue are somehow related to the main premise.</p>,

want to exclude these a tags

To get only text from paragraphs, you can use .get_text() method. For example:

import requests
from bs4 import BeautifulSoup

base_url = "https://www.usatoday.com/story/tech/2020/09/17/qanon-conspiracy-theories-debunked-social-media/5791711002/"
response = requests.get(base_url)

soup = BeautifulSoup(response.content, "lxml")
body = soup.select("article p")

for paragraph in body:
    print(paragraph.get_text(strip=True, separator=' '))

Prints:

An emboldened community of believers known as QAnon is spreading a baseless patchwork of conspiracy theories that are fooling Americans who are looking for simple answers in a time of intense political polarization, social isolation and economic turmoil.


...etc.

Or: You can .unwrap() all elements inside paragraph and the get text:

for paragraph in body:
    for tag in paragraph.find_all():
        tag.unwrap()
    print(paragraph.text)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM