[英]Beautifulsoup split text in tag by <br/>
Is it possible to split a text from a tag by br tags?是否可以通过 br 标签从标签中拆分文本?
I have this tag contents: [u'+420 777 593 531', <br/>, u'+420 776 593 531', <br/>, u'+420 775 593 531']
我有这个标签内容: [u'+420 777 593 531', <br/>, u'+420 776 593 531', <br/>, u'+420 775 593 531']
And I want to get only numbers.我只想得到数字。 Any advices?有什么建议吗?
EDIT:编辑:
[x for x in dt.find_next_sibling('dd').contents if x!=' <br/>']
Does not work at all.根本不起作用。
You need to test for tags , which are modelled as Element
instances. 您需要测试标记 ,这些标记被建模为Element
实例。 Element
objects have a name
attribute, while text elements don't (which are NavigableText
instances): Element
对象具有name
属性,而text元素没有(属于NavigableText
实例):
[x for x in dt.find_next_sibling('dd').contents if getattr(x, 'name', None) != 'br']
Since you appear to only have text and <br />
elements in that <dd>
element, you may as well just get all the contained strings instead: 由于您似乎只在<dd>
元素中包含text和<br />
元素,因此您也可以只获取所有包含的字符串 :
list(dt.find_next_sibling('dd').stripped_strings)
Demo: 演示:
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup('''\
... <dt>Term</dt>
... <dd>
... +420 777 593 531<br/>
... +420 776 593 531<br/>
... +420 775 593 531<br/>
... </dd>
... ''')
>>> dt = soup.dt
>>> [x for x in dt.find_next_sibling('dd').contents if getattr(x, 'name', None) != 'br']
[u'\n +420 777 593 531', u'\n +420 776 593 531', u'\n +420 775 593 531', u'\n']
>>> list(dt.find_next_sibling('dd').stripped_strings)
[u'+420 777 593 531', u'+420 776 593 531', u'+420 775 593 531']
Using get_text(strip=True, separator='\\n')
with str.splitlines
:使用get_text(strip=True, separator='\\n')
和str.splitlines
:
from bs4 import BeautifulSoup
soup = BeautifulSoup('''\
<dt>Term</dt>
<dd>
+420 777 593 531<br/>
+420 776 593 531<br/>
+420 775 593 531<br/>
</dd>
''', 'html.parser')
print(soup.dd.get_text(strip=True, separator='\n').splitlines())
# ['+420 777 593 531', '+420 776 593 531', '+420 775 593 531']
tag = BeautifulSoup('''
<dd>
+420 777 593 531<br/>
+420 776 593 531<br/>
+420 775 593 531<br/>
</dd>
''', 'html.parser')
Convert this to a string将此转换为字符串
str_tag = str(tag)
Now split using <br/>
tag and convert back to BeautifulSoup and extract text from it现在使用<br/>
标签拆分并转换回 BeautifulSoup 并从中提取文本
numbers = [BeautifulSoup(_,'html.parser').text.strip() for _ in str(soup).split('<br/>')]
# output : ['+420 777 593 531', '+420 776 593 531', '+420 775 593 531', '']
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.