[英]Why can siblings of a tag in BeautifulSoup4 be strings?
乍一看,我認為.next_sibling
和previous_sibling
應該是同級標簽是很自然的。 但是當我今天玩它時,它導致NavigableString像"\\n"
。
在仔細檢查其文檔之后 ,它指出:
In real documents, the .next_sibling or .previous_sibling of a tag will usually be a string containing whitespace. Going back to the “three sisters” document:
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
You might think that the .next_sibling of the first <a> tag would be the second <a> tag. But actually, it’s a string: the comma and newline that separate the first <a> tag from the second:
link = soup.a
link
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
link.next_sibling
# u',\n'
The second <a> tag is actually the .next_sibling of the comma:
link.next_sibling.next_sibling
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
這是為什么?
.find_next_sibling
屬性用於對HTML文檔進行細粒度搜索。 CSS選擇器無法執行的操作(它們可以選擇標簽,而不能選擇標簽之間的字符串,例如,您不能使用CSS選擇器選擇字符串SELECT THIS
: <p>some text</p>SELECT THIS<p>some text</p>
)。
如果要搜索同級標簽,請使用find_next_sibling()
方法。 您還可以通過將text=True
參數傳遞給find_next_sibling()
來模擬.find_next_sibling
行為:
data = '''
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(data, 'html.parser')
link = soup.a
print(link) # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
print(type(link.next_sibling)) # <class 'bs4.element.NavigableString'>
print(link.find_next_sibling()) # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
print(type(link.find_next_sibling(text=True))) # <class 'bs4.element.NavigableString'>
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.