如何从具有特定标记作为子项的抓取中排除标记

Question

I am trying to get all the paragraphs of an article using BeautifulSoup and exclude the paragraph tags that instead of the paragraph have another tag, such as an tag in them or if they do have an tag as a child only get the text of the paragraph. 我试图使用BeautifulSoup获取文章的所有段落并排除段落标记而不是段落有另一个标记，例如其中的标记或者如果它们确实有标记作为子项只获取段落的文本。

This is a part of the HTML 这是HTML的一部分

<div class="entry-content clearfix">
  <div class="entry-thumbnail>
  <p> In as name to here them deny wise this. As rapid woody my he me which. </p>
  <p> <a href="https://blabla"/> </p> 
  <p> Performed suspicion in certainty so frankness by attention pretended.
      Newspaper or in tolerably education enjoyment. </p>
  <p> <a href="https://blabla"/> When be draw drew ye. Defective in do recommend
      suffering. House it seven in spoil tiled court. Sister others marked 
      fat missed did out use.</p>
</div>

and this is what I have done till now 这就是我到现在为止所做的

 contents = []
 content = soup.find('div', { "class": "entry-content clearfix"}).find_all("p")
    for p in content:
        if not (p.find(findChildren("a"))):
            contents[p] = content
    if (content):
        dic['content'] = content
    else: 
        print("ARTICLE:", i, "HAS NO content")
        dic['body'] = "No content"

Answer 1

Use the function get_text(). 使用函数get_text（）。 It will extract the text from paragraphs. 它将从段落中提取文本。 Reference: https://www.pythonforbeginners.com/beautifulsoup/beautifulsoup-4-python 参考： https ： //www.pythonforbeginners.com/beautifulsoup/beautifulsoup-4-python

from bs4 import BeautifulSoup
contents = """<div class="entry-content clearfix">
  <div class="entry-thumbnail>
  <p> In as name to here them deny wise this. As rapid woody my he me which. </p>
  <p> <a href="https://blabla"/> </p> 
  <p> Performed suspicion in certainty so frankness by attention pretended.
      Newspaper or in tolerably education enjoyment. </p>
  <p> <a href="https://blabla"/> When be draw drew ye. Defective in do recommend
      suffering. House it seven in spoil tiled court. Sister others marked 
      fat missed did out use.</p>
</div>"""
soup = BeautifulSoup(contents, "lxml")
print(soup.get_text())

Result: 结果：

Performed suspicion in certainty so frankness by attention pretended.
      Newspaper or in tolerably education enjoyment. 
  When be draw drew ye. Defective in do recommend
      suffering. House it seven in spoil tiled court. Sister others marked 
      fat missed did out use.

如何从具有特定标记作为子项的抓取中排除标记

问题描述

1 个解决方案

解决方案1
0 已采纳 2019-06-05 21:07:27

如何从具有特定标记作为子项的抓取中排除标记

问题描述

1 个解决方案

解决方案1 0 已采纳 2019-06-05 21:07:27

解决方案1
0 已采纳 2019-06-05 21:07:27