HREF values search through the web page using BS4

Question

I am working on 3rd Party application where I have read view to the Webpage source content.And from there we have to collect only some href content values which has pattern like /aems/file/filegetrevision.do?fileEntityId . Is it possible? My one giving me all the href values.

HTML * (Part of HTML) *

<td width="50%">
<a href="/aems/file/filegetrevision.do?fileEntityId=10597525&cs=9b7sjueBiWLBEMj2ZU4I6fyQoPv-g0NLY9ETqP0gWk4.xyz">
screenshot.doc
</a>
</td>

CODE

for a in soup.find_all('a', {"style": "display:inline; position:relative;"}, href=True):
    href = a['href'].strip()
    href = "https://xyz.test.com/" + href
print(href)

Thanks

Thanks,

Answer 1

Yeah, just use a proper filter for the href attribute. Like

def filter(href):
    return '/aems/file/filegetrevision' in href

soup.find_all('a', href=filter)

Besides functions, you can also use RegexObject objects as filters:

filter = re.compile(some_regular_expression)
soup.find_all('a', href=filter)

See the docs: Kind of filters

HREF values search through the web page using BS4

Question

1 answers

solution1
2 ACCPTED 2013-01-08 16:01:55

HREF values search through the web page using BS4

Question

1 answers

solution1 2 ACCPTED 2013-01-08 16:01:55

solution1
2 ACCPTED 2013-01-08 16:01:55