如何使用 BeautifulSoup 从抓取的数据中删除 HTML 标签

Question

I'm trying to scrape news data where I want all the paragraphs of the news article.我正在尝试在我想要新闻文章的所有段落的地方抓取新闻数据。 So I used Soup.find_all('p') to scrape all the paragraphs but it contains HTML tags and since Soup.find_all('p') will return bs4.element.ResultSet datatype I can't use other methods like .get_text() or .decompose() or .stripe()所以我使用Soup.find_all('p')来抓取所有段落，但它包含 HTML 个标签，并且由于Soup.find_all('p')将返回bs4.element.ResultSet数据类型，我不能使用其他方法，如.get_text()或.decompose()或.stripe()

And I can't use Soup.find('p') as it will give the first paragraph only and I need all the paragraphs.而且我不能使用Soup.find('p')因为它只会给出第一段，而我需要所有段落。

Here is my code:这是我的代码：

for story in J:
    page3 = requests.get(story)
    SOUP = BeautifulSoup(page3.content, 'html.parser')
    q = SOUP.find_all('p')
    print(q[0])

Output: Output Output: Output

Answer 1

Simply iterate over your ResultSet to get the stripped text and join() the single texts by whitespace:只需遍历您的ResultSet以获取剥离的文本并通过空格join()单个文本：

' '.join([p.get_text(strip=True) for p in SOUP.find_all('p')])

Example例子

for story in J:
    page3 = requests.get(story)
    SOUP = BeautifulSoup(page3.content, 'html.parser')
    t = ' '.join([p.get_text(strip=True) for p in SOUP.find_all('p')])
    print(t)

如何使用 BeautifulSoup 从抓取的数据中删除 HTML 标签

问题描述

1 个解决方案

解决方案1
0 2022-04-04 16:29:17

Example例子

如何使用 BeautifulSoup 从抓取的数据中删除 HTML 标签

问题描述

1 个解决方案

解决方案1 0 2022-04-04 16:29:17

Example例子

解决方案1
0 2022-04-04 16:29:17