[英]How to remove HTML tags from the scraped data using BeautifulSoup
I'm trying to scrape news data where I want all the paragraphs of the news article.我正在尝试在我想要新闻文章的所有段落的地方抓取新闻数据。 So I used Soup.find_all('p')
to scrape all the paragraphs but it contains HTML tags and since Soup.find_all('p')
will return bs4.element.ResultSet
datatype I can't use other methods like .get_text()
or .decompose()
or .stripe()
所以我使用Soup.find_all('p')
来抓取所有段落,但它包含 HTML 个标签,并且由于Soup.find_all('p')
将返回bs4.element.ResultSet
数据类型,我不能使用其他方法,如.get_text()
或.decompose()
或.stripe()
And I can't use Soup.find('p')
as it will give the first paragraph only and I need all the paragraphs.而且我不能使用Soup.find('p')
因为它只会给出第一段,而我需要所有段落。
Here is my code:这是我的代码:
for story in J:
page3 = requests.get(story)
SOUP = BeautifulSoup(page3.content, 'html.parser')
q = SOUP.find_all('p')
print(q[0])
Simply iterate over your ResultSet
to get the stripped text and join()
the single texts by whitespace:只需遍历您的ResultSet
以获取剥离的文本并通过空格join()
单个文本:
' '.join([p.get_text(strip=True) for p in SOUP.find_all('p')])
for story in J:
page3 = requests.get(story)
SOUP = BeautifulSoup(page3.content, 'html.parser')
t = ' '.join([p.get_text(strip=True) for p in SOUP.find_all('p')])
print(t)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.