简体   繁体   English

如何从具有特定标记作为子项的抓取中排除标记

[英]How to exclude tags from scraping that have a specific tag as a child

I am trying to get all the paragraphs of an article using BeautifulSoup and exclude the paragraph tags that instead of the paragraph have another tag, such as an tag in them or if they do have an tag as a child only get the text of the paragraph. 我试图使用BeautifulSoup获取文章的所有段落并排除段落标记而不是段落有另一个标记,例如其中的标记或者如果它们确实有标记作为子项只获取段落的文本。

This is a part of the HTML 这是HTML的一部分

<div class="entry-content clearfix">
  <div class="entry-thumbnail>
  <p> In as name to here them deny wise this. As rapid woody my he me which. </p>
  <p> <a href="https://blabla"/> </p> 
  <p> Performed suspicion in certainty so frankness by attention pretended.
      Newspaper or in tolerably education enjoyment. </p>
  <p> <a href="https://blabla"/> When be draw drew ye. Defective in do recommend
      suffering. House it seven in spoil tiled court. Sister others marked 
      fat missed did out use.</p>
</div>

and this is what I have done till now 这就是我到现在为止所做的

 contents = []
 content = soup.find('div', { "class": "entry-content clearfix"}).find_all("p")
    for p in content:
        if not (p.find(findChildren("a"))):
            contents[p] = content
    if (content):
        dic['content'] = content
    else: 
        print("ARTICLE:", i, "HAS NO content")
        dic['body'] = "No content"

Use the function get_text(). 使用函数get_text()。 It will extract the text from paragraphs. 它将从段落中提取文本。 Reference: https://www.pythonforbeginners.com/beautifulsoup/beautifulsoup-4-python 参考: https//www.pythonforbeginners.com/beautifulsoup/beautifulsoup-4-python

from bs4 import BeautifulSoup
contents = """<div class="entry-content clearfix">
  <div class="entry-thumbnail>
  <p> In as name to here them deny wise this. As rapid woody my he me which. </p>
  <p> <a href="https://blabla"/> </p> 
  <p> Performed suspicion in certainty so frankness by attention pretended.
      Newspaper or in tolerably education enjoyment. </p>
  <p> <a href="https://blabla"/> When be draw drew ye. Defective in do recommend
      suffering. House it seven in spoil tiled court. Sister others marked 
      fat missed did out use.</p>
</div>"""
soup = BeautifulSoup(contents, "lxml")
print(soup.get_text()) 

Result: 结果:

Performed suspicion in certainty so frankness by attention pretended.
      Newspaper or in tolerably education enjoyment. 
  When be draw drew ye. Defective in do recommend
      suffering. House it seven in spoil tiled court. Sister others marked 
      fat missed did out use.

web如何抓取特定标签<p>在</p><div>使用 HTML 中的 Python</div><div id="text_translate"><p> 我要提取的数据来自这个网站<strong><a href="https://www.adobe.com/support/security/advisories/apsa11-04.html" rel="nofollow noreferrer">https://www.adobe.com/support/security/advisories/apsa11-04.html</a></strong> 。 我只想提取</p><blockquote><p>发布日期:2011 年 12 月 6 日 最后更新时间:2012 年 1 月 10 日 漏洞标识符:APSA11-04 CVE 编号:CVE-2011-2462</p></blockquote><p> 编码:</p><pre> from bs4 import BeautifulSoup div = soup.find("div", attrs={"id": "L0C1-body"}) for p in div.findAll("p"): if p.find('strong'): print(p.text)</pre><p> output:</p><pre> Release date: December 6, 2011 Last updated: January 10, 2012 Vulnerability identifier: APSA11-04 CVE number: CVE-2011-2462 Platform: All *Note: Adobe Reader for Android and Adobe Flash Player are not affected by this issue.</pre><p> 我不想要这些信息。 我应该如何过滤它?</p><blockquote><p> 平台:全部 *注意:用于 Android 的 Adobe Reader 和 Adobe Flash Player 不受此问题的影响。</p></blockquote></div> - How to web scraping specific tags <p> in <div> using Python from HTML

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 抓取:如何使用 bS4 排除特定标签 - Scraping : How to exclude specific tag with bS4 web抓取时如何排除标签 - How to exclude a tag when web scraping 如何从父标签中排除嵌套标签以将输出作为文本跳过链接 (a) 标签 - how to exclude nested tags from a parent tag to just get the ouput as text skipping the links (a) tags 如何从标签中获取文本,但忽略其他子标签 - how to get text from within a tag, but ignore other child tags Python和Beautifulsoup Web抓取-选择带有特定子标记的段落 - Python & Beautifulsoup web scraping - select a paragraph with a specific child tag 如何在抓取网站时排除带有标签的特定文本? - How do I exclude a particular text with tag while scraping a website? 如何在使用scrapy时从多个标签中排除特定的html标签(无任何ID)? - How to exclude a particular html tag(without any id) from several tags while using scrapy? web如何抓取特定标签<p>在</p><div>使用 HTML 中的 Python</div><div id="text_translate"><p> 我要提取的数据来自这个网站<strong><a href="https://www.adobe.com/support/security/advisories/apsa11-04.html" rel="nofollow noreferrer">https://www.adobe.com/support/security/advisories/apsa11-04.html</a></strong> 。 我只想提取</p><blockquote><p>发布日期:2011 年 12 月 6 日 最后更新时间:2012 年 1 月 10 日 漏洞标识符:APSA11-04 CVE 编号:CVE-2011-2462</p></blockquote><p> 编码:</p><pre> from bs4 import BeautifulSoup div = soup.find("div", attrs={"id": "L0C1-body"}) for p in div.findAll("p"): if p.find('strong'): print(p.text)</pre><p> output:</p><pre> Release date: December 6, 2011 Last updated: January 10, 2012 Vulnerability identifier: APSA11-04 CVE number: CVE-2011-2462 Platform: All *Note: Adobe Reader for Android and Adobe Flash Player are not affected by this issue.</pre><p> 我不想要这些信息。 我应该如何过滤它?</p><blockquote><p> 平台:全部 *注意:用于 Android 的 Adobe Reader 和 Adobe Flash Player 不受此问题的影响。</p></blockquote></div> - How to web scraping specific tags <p> in <div> using Python from HTML 抓取特定的子元素 - Scraping Specific Child Elements Web 从具有特定标签 ID 的特定部分抓取标题和副标题 - Web scraping headlines and subtitles from a specific section with a specific tag id
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM