简体   繁体   English

BeautifulSoup:提取不在给定标记中的文本

[英]BeautifulSoup: Extract the text that is not in a given tag

I have the following variable, header equal to: 我有以下变量, header等于:

<p>Andrew Anglin<br/>
<strong>Daily Stormer</strong><br/>
February 11, 2017</p>

I want to extract from this variable only the date February 11, 2017 . 我想从这个变量中提取February 11, 2017的日期。 How can I do it using BeautifulSoup in python? 如何在python中使用BeautifulSoup来做到这一点?

If you know that the date is always the last text node in the header variable, then you could access the .contents property and get the last element in the returned list: 如果您知道日期始终是标头变量中的最后一个文本节点,那么您可以访问.contents属性并获取返回列表中的最后一个元素:

from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'html.parser')
header = soup.find('p')

header.contents[-1].strip()
> February 11, 2017

Or, as MYGz pointed out in the comments below , you could split the text at new lines and retrieve the last element in the list: 或者,正如MYGz在下面的评论中指出的那样 ,您可以将文本拆分为新行并检索列表中的最后一个元素:

from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'html.parser')
header = soup.find('p')

header.text.split('\n')[-1]
> February 11, 2017

If you don't know the position of the date text node, then another option would be to parse out any matching strings: 如果您不知道日期文本节点的位置,那么另一个选项是解析任何匹配的字符串:

from bs4 import BeautifulSoup
import re

soup = BeautifulSoup(html, 'html.parser')
header = soup.find('p')

re.findall(r'\w+ \d{1,2}, \d{4}', header.text)[0]
> February 11, 2017

However, as your title implies, if you only want to retrieve text nodes that aren't wrapped with an element tag, then you could use the following which will filter out elements: 但是,正如您的标题所暗示的,如果您只想检索未使用元素标记包装的文本节点,那么您可以使用以下过滤掉元素:

from bs4 import BeautifulSoup
import re

soup = BeautifulSoup(html, 'html.parser')
header = soup.find('p')

text_nodes = [e.strip() for e in header if not e.name and e.strip()]

Keep in mind that would return the following since the first text node isn't wrapped: 请记住,由于未包装第一个文本节点,因此将返回以下内容:

> ['Andrew Anglin', 'February 11, 2017']

Of course you could also combine the last two options and parse out the date strings in the returned text nodes: 当然,您还可以组合最后两个选项并解析返回的文本节点中的日期字符串:

from bs4 import BeautifulSoup
import re

soup = BeautifulSoup(html, 'html.parser')
header = soup.find('p')

for node in header:
    if not node.name and node.strip():
        match = re.findall(r'^\w+ \d{1,2}, \d{4}$', node.strip())
        if match:
            print(match[0])

> February 11, 2017

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM