简体   繁体   中英

Parsing HTML pages using Beautiful Soup-family trees

I am writing a parsing script that should access "img" tags in an HTML page (and I am utilizing Beautiful Soup.) I am using the findAll method in order to access each image in the code, but I also want to access an additional piece of information. I want the the title of each image which is located in the immediate preceding "a" tag, under the href attribute. The HTML code looks like:

div class="thumbinner" style="width:202px;"><a href="/wiki/File:Edmund-Hillary.web.jpg" class="image">img alt="" src="//upload.wikimedia.org/wikipedia/commons/thumb/8/8d/Edmund-Hillary.web.jpg/200px-Edmund-Hillary.web.jpg" width="200" height="272" class="thumbimage" srcset="//upload.wikimedia.........

I am trying to use the parent/child methods from beautiful soup but am getting errors. My code looks something like:

images= soup.findAll("img", width=true)#access all image tags
jpegtitles= images.siblings['href']

I figured since the "img" tag and "a" tag were both children of the "div" tag then they would be accessible by the sibling method.

Any suggestions on how I could access: href="/wiki/File:Edmund-Hillary.web.jpg" ??

Because the image tag is after the <a> tag, you want to find the parent and not the sibling:

>>> soup.find('img', width=True).parent['href']
'/wiki/File:Edmund-Hillary.web.jpg'

The other problem with your code is that findAll returns a list, and so you can't directly call .siblings on it. If you have multiple images, use a loop.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM