The output is part of a list. When I try to figure out the output's type using type() it returns: <class 'bs4.element.Tag'>.
I am trying to remove everything to the left of "href" and everything to the right of "<img". I have tried lstrip and rstrip but they do not work because each output in my list is unique. Even though each output in the list is unqiue they all have the same format with "href" and "<img".
Here is an example of what one of the outputs in my list:
<a class="BlogList-item-image-link" href="/new-blog/nova-approval">
<img alt="Nova Approval" data-image="https://static1.squarespace.com/static/54ceeff4e4b0d9096117315a/5a3ff7e48165f5d70b78414a/5a504ba90d9297f9a55e4ab6/1516062801655/7P1A5814+cropped.jpg" data-image-dimensions="2432x2688" data-image-focal-point="0.5,0.5" data-load="false" data-src="https://static1.squarespace.com/static/54ceeff4e4b0d9096117315a/5a3ff7e48165f5d70b78414a/5a504ba90d9297f9a55e4ab6/1516062801655/7P1A5814+cropped.jpg"/>
</a>
You are probably trying to extract the link in href. For that you don't need to strip the string. You could do it in the following way -
string = '''<a class="BlogList-item-image-link" href="/new-blog/nova-approval">
<img alt="Nova Approval" data-image="https://static1.squarespace.com/static/54ceeff4e4b0d9096117315a/5a3ff7e48165f5d70b78414a/5a504ba90d9297f9a55e4ab6/1516062801655/7P1A5814+cropped.jpg" data-image-dimensions="2432x2688" data-image-focal-point="0.5,0.5" data-load="false" data-src="https://static1.squarespace.com/static/54ceeff4e4b0d9096117315a/5a3ff7e48165f5d70b78414a/5a504ba90d9297f9a55e4ab6/1516062801655/7P1A5814+cropped.jpg"/>
</a>'''
print( string[string.find('href="')+6:string.find('>')-1] )
Output:
/new-blog/nova-approval
In the above print()
statement, string.find('href="')
will return the index of that string and we are then looping from that index + 6 to the end of href tag. This is assuming >
follows right after href
.
Hope this helps !
Using lstrip
and rstrip
won't be the answer.
Have you tried looking at the bs4 docs ?
Because the type of your output is a bs4 object. You can just find the attribute of the object to get the href
.
<a class="BlogList-item-image-link" href="/new-blog/nova-approval">
<img alt="Nova Approval" data-image="https://static1.squarespace.com/static/54ceeff4e4b0d9096117315a/5a3ff7e48165f5d70b78414a/5a504ba90d9297f9a55e4ab6/1516062801655/7P1A5814+cropped.jpg" data-image-dimensions="2432x2688" data-image-focal-point="0.5,0.5" data-load="false" data-src="https://static1.squarespace.com/static/54ceeff4e4b0d9096117315a/5a3ff7e48165f5d70b78414a/5a504ba90d9297f9a55e4ab6/1516062801655/7P1A5814+cropped.jpg"/>
</a>
from bs4 import BeautifulSoup
soup = BeautifulSoup('html') #put the link there
links = soup.find_all('a') # All of the anchor tags in a list
for link in links:
print(link.get('href'))
This will print all of the href
values in the HTML file.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.