Beautifulsoup 将标签中的文本拆分为

Question

Is it possible to split a text from a tag by br tags?是否可以通过 br 标签从标签中拆分文本？

I have this tag contents: [u'+420 777 593 531', , u'+420 776 593 531', , u'+420 775 593 531']我有这个标签内容： [u'+420 777 593 531', , u'+420 776 593 531', , u'+420 775 593 531']

And I want to get only numbers.我只想得到数字。 Any advices?有什么建议吗？

EDIT:编辑：

[x for x in dt.find_next_sibling('dd').contents if x!=' <br/>']

Does not work at all.根本不起作用。

Answer 1

You need to test for tags , which are modelled as Element instances. 您需要测试标记，这些标记被建模为Element实例。 Element objects have a name attribute, while text elements don't (which are NavigableText instances): Element对象具有name属性，而text元素没有（属于NavigableText实例）：

[x for x in dt.find_next_sibling('dd').contents if getattr(x, 'name', None) != 'br']

Since you appear to only have text and   elements in that <dd> element, you may as well just get all the contained strings instead: 由于您似乎只在<dd>元素中包含text和 元素，因此您也可以只获取所有包含的字符串：

list(dt.find_next_sibling('dd').stripped_strings)

Demo: 演示：

>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup('''\
... <dt>Term</dt>
... <dd>
...     +420 777 593 531<br/>
...     +420 776 593 531<br/>
...     +420 775 593 531<br/>
... </dd>
... ''')
>>> dt = soup.dt
>>> [x for x in dt.find_next_sibling('dd').contents if getattr(x, 'name', None) != 'br']
[u'\n    +420 777 593 531', u'\n    +420 776 593 531', u'\n    +420 775 593 531', u'\n']
>>> list(dt.find_next_sibling('dd').stripped_strings)
[u'+420 777 593 531', u'+420 776 593 531', u'+420 775 593 531']

Answer 2

Using get_text(strip=True, separator='\\n') with str.splitlines :使用get_text(strip=True, separator='\\n')和str.splitlines ：

from bs4 import BeautifulSoup

soup = BeautifulSoup('''\
<dt>Term</dt>
<dd>
    +420 777 593 531<br/>
    +420 776 593 531<br/>
    +420 775 593 531<br/>
</dd>
''', 'html.parser')
print(soup.dd.get_text(strip=True, separator='\n').splitlines())
# ['+420 777 593 531', '+420 776 593 531', '+420 775 593 531']

Answer 3

tag =  BeautifulSoup('''
<dd>
    +420 777 593 531<br/>
    +420 776 593 531<br/>
    +420 775 593 531<br/>
</dd>
''', 'html.parser')

Convert this to a string将此转换为字符串

str_tag = str(tag)

Now split using   tag and convert back to BeautifulSoup and extract text from it现在使用 标签拆分并转换回 BeautifulSoup 并从中提取文本

numbers = [BeautifulSoup(_,'html.parser').text.strip() for _ in str(soup).split('<br/>')]
# output : ['+420 777 593 531', '+420 776 593 531', '+420 775 593 531', '']

Beautifulsoup 将标签中的文本拆分为<br/>

问题描述

3 个解决方案

解决方案1
9 2015-06-07 14:20:37

解决方案2
0 2021-10-29 11:18:54

解决方案3
0 2022-01-08 16:03:16

Beautifulsoup 将标签中的文本拆分为<br/>

问题描述

3 个解决方案

解决方案1 9 2015-06-07 14:20:37

解决方案2 0 2021-10-29 11:18:54

解决方案3 0 2022-01-08 16:03:16

解决方案1
9 2015-06-07 14:20:37

解决方案2
0 2021-10-29 11:18:54

解决方案3
0 2022-01-08 16:03:16