繁体   English   中英

在BeautifulSoup4中的两个标签之间获取段落

[英]Getting paragraphs between two tags in BeautifulSoup4 [closed]

我是编程,Python和BS4的新手,我希望通过Web爬网程序项目变得更好。 我有一堆类似的页面,其中包含我想分开的信息。 这是我需要使用的模板:

<h3>Synopsis</h3>
<p>First part of synopsis</p>
<p>Second part of paragraph</p>
<p>Third part of paragraph</p>
<p class="writerDirector"><strong>Written By:</strong> Writer<br>
<strong>Directed By:</strong> Director</p>
<h4>Cast</h4>
<p>List of the cast in one line</p>

“导演”和“作者”信息的确很容易收集,但我也想提供摘要和演员表段落。 问题在于,该提要在网站上并不总是三段长(有时更少,有时更多),因此我无法对其进行硬编码。 我的想法是将文本中的“提要”一词用作起点和终点,并收集其间的所有内容,但我不确定该如何实现。 我尝试使用正则表达式,但是我不太了解它,也不知道如何在正则表达式中使用html标记。

任何帮助,将不胜感激。

from bs4 import BeautifulSoup

text = """<h3>Synopsis</h3>
<p>First part of synopsis</p>
<p>Second part of paragraph</p>
<p>Third part of paragraph</p>
<p class="writerDirector"><strong>Written By:</strong> Writer<br>
<strong>Directed By:</strong> Director</p>
<h4>Cast</h4>
<p>List of the cast in one line</p>"""

soup = BeautifulSoup(text, "html.parser")

synopsis = ''
for para in soup.find_all("p"):
    if para.get('class') == ['writerDirector']:
        break
    synopsis += para.text + '\n'

print(synopsis)

输出:

First part of synopsis
Second part of paragraph
Third part of paragraph

获得案例需要一些硬编码:

cast_text = text[text.index('<h4>Cast</h4>'):]

soup = BeautifulSoup(cast_text, "html.parser")

cast_members = ''
for para in soup.find_all('p'):
    cast_members += para.text + '\n'

print(cast_members)

输出:

List of the cast in one line

这可能捕获了满足您需求的技术的要点。

您知道所需的内容以H3元素开头。 然后,您开始浏览其next_siblings 诸如空行('\\ n')之类的sibling.name具有sibling.nameNone ,我们可以安全地忽略它们。 这段代码显示sibling.name并为H3元素的每个同级显示完整的sibling 您已经表明您已经知道如何挖掘这些内容。

现在,您所要做的就是编写代码,在看到“ Cast”的h4元素时发出提示,以便它可以安排为演员阵容再读取一个p元素。

>>> HTML = '''\
... <h3>Synopsis</h3>
... <p>First part of synopsis</p>
... <p>Second part of paragraph</p>
... <p>Third part of paragraph</p>
... <p class="writerDirector"><strong>Written By:</strong> Writer<br>
... <strong>Directed By:</strong> Director</p>
... <h4>Cast</h4>
... <p>List of the cast in one line</p>
... '''
>>> import bs4
>>> soup = bs4.BeautifulSoup(HTML, 'lxml')
>>> h3 = soup.find('h3')
>>> for sibling in h3.next_siblings:
...     if sibling.name:
...         sibling.name
...         sibling
...         
'p'
<p>First part of synopsis</p>
'p'
<p>Second part of paragraph</p>
'p'
<p>Third part of paragraph</p>
'p'
<p class="writerDirector"><strong>Written By:</strong> Writer<br/>
<strong>Directed By:</strong> Director</p>
'h4'
<h4>Cast</h4>
'p'
<p>List of the cast in one line</p>

假设您在页面上有多个摘要(即使您没有),也可以遍历汤并收集h3概要标签之间的所有内容:

from bs4 import BeautifulSoup

html ="""<html><h3>Synopsis</h3>
<p>First part of synopsis</p>
<p>Second part of paragraph</p>
<p>Third part of paragraph</p>
<p class="writerDirector"><strong>Written By:</strong> Writer<br>
<strong>Directed By:</strong> Director</p>
<h4>Cast</h4>
<p>List of the cast in one line</p>
<h3>Synopsis</h3>
<p>First part of synopsis 2</p>
<p>Second part of paragraph 2</p>
<p class="writerDirector"><strong>Written By:</strong> Writer 2<br>
<strong>Directed By:</strong> Director 2</p>
<h4>Cast</h4>
<p>List of the cast in one line 2</p></html>"""


soup = BeautifulSoup(html, 'lxml')
value = ""
start = False

for i in soup.find_all():
    if i.name == 'h3' and  i.string=='Synopsis':
        if start:
            print (value)
            value = ""
        print ("Synopsis")
        start = True
    elif i.text is not None and start:
        value = value + " " + i.text
if value:
    print (value)

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM