简体   繁体   中英

How to select tags that have certain attribute type

Here's the Thing

I want to crawl only these tags in the full of other messy html

<table bgcolor="FFFFFF" border="0" cellpadding="5" cellspacing="0" align="center">
    <tr>
        <td>
            <a href="./index.html?id=subjective&page=2">
                <img src='https://www.dogdrip.net/?module=file&act=procFileDownload&file_srl=224868098&sid=cc8c0afbb679bef6420500988a756054&module_srl=78' style='max-width:180px;max-height:270px' align='absmiddle' title="cutie cat">
            </a>
        </td>
    </tr>
</table>

I tried for the first time with CSS selector selector was

#div_article_contents > tr:nth-child(1) > th:nth-child(1) > table > tbody > tr:nth-child(1) > td > table > tbody > tr > td > a > img

but soup.select('selector') wasn't works. It output empty list. I don't know why

Secondly I tried with tag every that I want to crawl have specific style so I tried:

soup.select('img[style = fixedstyle]')

but it wasn't works. It would be syntax error...

all I want to crawl is list of href links and list of img titles

please help me

If the img tag has a specific style value you can use what you tried just delete extra spaces:

from bs4 import BeautifulSoup

html='''
<a href='link'>
    <img src='address' style='max-width:222px;max-height:222px' title='owntitle'>
</a>
<a href='link'>
    <img src='address1' style='max-width:222px;max-height:222px' title='owntitle1'>
</a>
<a href='link'>
    <img src='address2' style='max-width:222px;max-height:222px' title='owntitle2'>
</a>
'''

srcs=[]
titles=[]
soup=BeautifulSoup(html,'html.parser')
for img in soup.select('img["style=max-width:222px;max-height:222px"]'):
    srcs.append(img['src'])
    titles.append(img['title'])
print(srcs)
print(titles)

Other wise you can start with the a tag and get down to the img like this:

for a in soup.select('a'):
    srcs.append(a.select_one('img')['src'])
    titles.append(a.select_one('img')['title'])
print(srcs)
print(titles)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM