简体   繁体   中英

Lstrip and Rstrip won't work, need help removing text from an output in Python 3

The output is part of a list. When I try to figure out the output's type using type() it returns: <class 'bs4.element.Tag'>.

I am trying to remove everything to the left of "href" and everything to the right of "<img". I have tried lstrip and rstrip but they do not work because each output in my list is unique. Even though each output in the list is unqiue they all have the same format with "href" and "<img".

Here is an example of what one of the outputs in my list:

<a class="BlogList-item-image-link" href="/new-blog/nova-approval">
<img alt="Nova Approval" data-image="https://static1.squarespace.com/static/54ceeff4e4b0d9096117315a/5a3ff7e48165f5d70b78414a/5a504ba90d9297f9a55e4ab6/1516062801655/7P1A5814+cropped.jpg" data-image-dimensions="2432x2688" data-image-focal-point="0.5,0.5" data-load="false" data-src="https://static1.squarespace.com/static/54ceeff4e4b0d9096117315a/5a3ff7e48165f5d70b78414a/5a504ba90d9297f9a55e4ab6/1516062801655/7P1A5814+cropped.jpg"/>
</a>

You are probably trying to extract the link in href. For that you don't need to strip the string. You could do it in the following way -

string =  '''<a class="BlogList-item-image-link" href="/new-blog/nova-approval">
<img alt="Nova Approval" data-image="https://static1.squarespace.com/static/54ceeff4e4b0d9096117315a/5a3ff7e48165f5d70b78414a/5a504ba90d9297f9a55e4ab6/1516062801655/7P1A5814+cropped.jpg" data-image-dimensions="2432x2688" data-image-focal-point="0.5,0.5" data-load="false" data-src="https://static1.squarespace.com/static/54ceeff4e4b0d9096117315a/5a3ff7e48165f5d70b78414a/5a504ba90d9297f9a55e4ab6/1516062801655/7P1A5814+cropped.jpg"/>
</a>'''


print( string[string.find('href="')+6:string.find('>')-1] )

Output:

/new-blog/nova-approval

In the above print() statement, string.find('href="') will return the index of that string and we are then looping from that index + 6 to the end of href tag. This is assuming > follows right after href .

Hope this helps !

Using lstrip and rstrip won't be the answer.

Have you tried looking at the bs4 docs ?

Because the type of your output is a bs4 object. You can just find the attribute of the object to get the href .

<a class="BlogList-item-image-link" href="/new-blog/nova-approval">
<img alt="Nova Approval" data-image="https://static1.squarespace.com/static/54ceeff4e4b0d9096117315a/5a3ff7e48165f5d70b78414a/5a504ba90d9297f9a55e4ab6/1516062801655/7P1A5814+cropped.jpg" data-image-dimensions="2432x2688" data-image-focal-point="0.5,0.5" data-load="false" data-src="https://static1.squarespace.com/static/54ceeff4e4b0d9096117315a/5a3ff7e48165f5d70b78414a/5a504ba90d9297f9a55e4ab6/1516062801655/7P1A5814+cropped.jpg"/>
</a>
from bs4 import BeautifulSoup

soup = BeautifulSoup('html') #put the link there

links = soup.find_all('a') # All of the anchor tags in a list

for link in links:
    print(link.get('href'))

This will print all of the href values in the HTML file.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM