简体   繁体   中英

BS4: Getting text in tag

I'm using beautiful soup. There is a tag like this:

<li><a href="example"> sro, <small>small</small></a></li>

I want to get the text within the anchor <a> tag only, without any from the <small> tag in the output; ie " sro, "

I tried find('li').text[0] but it does not work.

Is there a command in BS4 which can do that?

One option would be to get the first element from the contents of the a element:

>>> from bs4 import BeautifulSoup
>>> data = '<li><a href="example"> s.r.o., <small>small</small></a></li>'
>>> soup = BeautifulSoup(data)
>>> print soup.find('a').contents[0]
 s.r.o., 

Another one would be to find the small tag and get the previous sibling :

>>> print soup.find('small').previous_sibling
 s.r.o., 

Well, there are all sorts of alternative/crazy options also:

>>> print next(soup.find('a').descendants)
 s.r.o., 
>>> print next(iter(soup.find('a')))
 s.r.o., 

Use .children

soup.find('a').children.next()
s.r.o.,

If you would like to loop to print all content of anchor tags located in html string/web page (must utilise urlopen from urllib), this works:

from bs4 import BeautifulSoup
data = '<li><a href="example">s.r.o., <small>small</small</a></li> <li><a href="example">2nd</a></li> <li><a href="example">3rd</a></li>'
soup = BeautifulSoup(data,'html.parser')
a_tag=soup('a')
for tag in a_tag:
    print(tag.contents[0])     #.contents method to locate text within <a> tags

Output:

s.r.o.,  
2nd
3rd

a_tag is a list containing all anchor tags; collecting all anchor tags in a list, enables group editing (if more than one <a> tags present.

>>>print(a_tag)
[<a href="example">s.r.o.,  <small>small</small></a>, <a href="example">2nd</a>, <a href="example">3rd</a>]

From the documentation, retrieving the text of the tag can be done by calling string property

soup = BeautifulSoup('<li><a href="example"> s.r.o., <small>small</small></a></li>')
res = soup.find('a')
res.small.decompose()
print(res.string)
# s.r.o., 

https://www.crummy.com/software/BeautifulSoup/bs4/doc/#navigablestring

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM