简体   繁体   中英

How to extract a particular link from a site if it has two links one which I want and other which I don't want?

 <td> <input type="hidden" name="ctl00$ContentPlaceHolder1$dlstCollege$ctl01$hdnInstituteId" id="ContentPlaceHolder1_dlstCollege_hdnInstituteId_1" value="866 " /> <a id="ContentPlaceHolder1_dlstCollege_hlpkInstituteName_1" href="CollegeDetailedInformation.aspx?Inst=866 ">ANA INSTITUTE OF PHARMACEUTICAL SCIENCES & RESEARCH,BAREILLY (866)</a> <br /> <b>Location:</b> <span id="ContentPlaceHolder1_dlstCollege_lblAddress_1">13.5 km Bareilly - Delhi road, near rubber factory agras road,Bareilly</span> <br /> <b>Course:</b> <span id="ContentPlaceHolder1_dlstCollege_lblCourse_1">B.Pharm,</span> <br /> <b>Category:</b> <span id="ContentPlaceHolder1_dlstCollege_lblInstituteType_1">Private</span> <br /> <b>Web Address:</b> <a id="lnkBtnWebURL" href='' target="_blank"></a> <br /> </td> </tr> <tr> <td> <input type="hidden" name="ctl00$ContentPlaceHolder1$dlstCollege$ctl02$hdnInstituteId" id="ContentPlaceHolder1_dlstCollege_hdnInstituteId_2" value="486 " /> <a id="ContentPlaceHolder1_dlstCollege_hlpkInstituteName_2" href="CollegeDetailedInformation.aspx?Inst=486 ">ANACOLLEGE OF ENGINEERING & MANAGEMENT,BAREILLY (486)</a> <br /> <b>Location:</b> <span id="ContentPlaceHolder1_dlstCollege_lblAddress_2">13.5 Km. NH-24, Bareilly-Delhi Highway, Near Rubber Factory, Bareilly</span> <br /> <b>Course:</b> <span id="ContentPlaceHolder1_dlstCollege_lblCourse_2">B.Tech,M.Tech,</span> <br /> <b>Category:</b> <span id="ContentPlaceHolder1_dlstCollege_lblInstituteType_2">Private</span> <br /> <b>Web Address:</b> <a id="lnkBtnWebURL" href='http://www.anacollege.org/index.html' target="_blank">http://www.anacollege.org/index.html</a> <br /> </td> </tr>

I want to extract a particular URL(for eg: CollegeDetailedInformation.aspx?Inst=866) from this website but this code has two tags one of which I don't want(for eg: http://www.anacollege.org/index.html ).


res = requests.get('https://erp.aktu.ac.in/WebPages/KYC/CollegeList.aspx?City=&CType=&Cu=&Br=&Inst=&IType=')
soup = BeautifulSoup(res.content, 'html.parser')


table = soup.find("table", attrs = {'class':'table table-bordered table-responsive'})

pagelink = []
for anchor in table.findAll('a')[1:]:
        link = anchor['href']
        print(link)
        url = 'https://erp.aktu.ac.in/WebPages/KYC/'+ link
        pagelink.append(url)
print(pagelinks)

I wrote this code but it is extracting all the links

CollegeDetailedInformation.aspx?Inst=486  
http://www.anacollege.org/index.html
CollegeDetailedInformation.aspx?Inst=602  
http://www.aashlarbschool.com
CollegeDetailedInformation.aspx?Inst=032  
http://www.abes.ac.in
CollegeDetailedInformation.aspx?Inst=290  
http://www.abesit.in
CollegeDetailedInformation.aspx?Inst=913  
http://www.abesitpharmacy.in
CollegeDetailedInformation.aspx?Inst=643  
http://www.vitsald.com
CollegeDetailedInformation.aspx?Inst=1036 
http://www.abss.edu.in

how do I solve this I only want the link with the CollegeDetailedInformation.aspx?Inst=? part.

The anchor elements which are links to view college details have an id attribute which starts with ContentPlaceHolder1_dlstCollege_ . So pass that as a regex to the attrs argument of find_all() :

import re

for anchor in table.findAll('a', attrs={"id": re.compile("^ContentPlaceHolder1_dlstCollege_.*")}):
    ...

You can also just pass that as an id keyword argument to find_all() :

for anchor in table.findAll('a', id=re.compile("^ContentPlaceHolder1_dlstCollege_.*")):
    ...

The regex can be made even more specific, like "^ContentPlaceHolder1_dlstCollege_hlpkInstituteName_.*" which should only match the link provided with the college's name.

(I would remove the [1:] you put at the end since this probably filters out the link at the start you didn't want. If it doesn't, then add it back in.)

You can use CSS selector and use it to find all link a[href*=CollegeDetailedInformation] Whatever you want.

import requests
from bs4 import BeautifulSoup

res = requests.get('https://erp.aktu.ac.in/WebPages/KYC/CollegeList.aspx?City=&CType=&Cu=&Br=&Inst=&IType=')
soup = BeautifulSoup(res.content, 'html.parser')


table = soup.find("table", attrs = {'class':'table table-bordered table-responsive'})

allAnchor = table.select("a[href*=CollegeDetailedInformation]")

pagelink = []
for anchor  in allAnchor:
    link = anchor['href']
    # print(link)
    url = 'https://erp.aktu.ac.in/WebPages/KYC/'+ link
    pagelink.append(url)

print(pagelink)

Output will be:

['https://erp.aktu.ac.in/WebPages/KYC/CollegeDetailedInformation.aspx?Inst=968  ',
'https://erp.aktu.ac.in/WebPages/KYC/CollegeDetailedInformation.aspx?Inst=866  ',
'https://erp.aktu.ac.in/WebPages/KYC/CollegeDetailedInformation.aspx?Inst=486  ',
'https://erp.aktu.ac.in/WebPages/KYC/CollegeDetailedInformation.aspx?Inst=602  ',
'https://erp.aktu.ac.in/WebPages/KYC/CollegeDetailedInformation.aspx?Inst=032  ',
'https://erp.aktu.ac.in/WebPages/KYC/CollegeDetailedInformation.aspx?Inst=290  ',
'https://erp.aktu.ac.in/WebPages/KYC/CollegeDetailedInformation.aspx?Inst=913  ',
'https://erp.aktu.ac.in/WebPages/KYC/CollegeDetailedInformation.aspx?Inst=643  ',
'https://erp.aktu.ac.in/WebPages/KYC/CollegeDetailedInformation.aspx?Inst=1036 ',
'https://erp.aktu.ac.in/WebPages/KYC/CollegeDetailedInformation.aspx?Inst=312  ',
'https://erp.aktu.ac.in/WebPages/KYC/CollegeDetailedInformation.aspx?Inst=986  ',
'https://erp.aktu.ac.in/WebPages/KYC/CollegeDetailedInformation.aspx?Inst=686  ',
'https://erp.aktu.ac.in/WebPages/KYC/CollegeDetailedInformation.aspx?Inst=805  ',
'https://erp.aktu.ac.in/WebPages/KYC/CollegeDetailedInformation.aspx?Inst=225  ',
'https://erp.aktu.ac.in/WebPages/KYC/CollegeDetailedInformation.aspx?Inst=799  ',
'https://erp.aktu.ac.in/WebPages/KYC/CollegeDetailedInformation.aspx?Inst=041  ',
'https://erp.aktu.ac.in/WebPages/KYC/CollegeDetailedInformation.aspx?Inst=952  ',

and so on....
]

I don't know Python but the general rule would be to populate an array in the for loop then do a lookup for a sub-string having your filter, pick the index and grab all that is in that index.

Initialize and empty array outside loop (if empty is allowed in Python), populate it in the loop, then do something like in_array (for php) for your filter: CollegeDetailedInformation.aspx?Inst=?.

This should be a good start as the masters of Python come on board to help.

Try the following code snippet. Also install **lxml** library with pip before proceeding

import requests as rq
from bs4 import BeautifulSoup as bs

es = rq.get('https://erp.aktu.ac.in/WebPages/KYC/CollegeList.aspx?City=&CType=&Cu=&Br=&Inst=&IType=')
soup = bs(res.content, 'lxml')

table = soup.find("table", attrs = {'class':'table table-bordered table-responsive'})


links = [elem.strip() for anchor in table.findAll('a') for _,elem in anchor.attrs.items() if "=" in elem]

print(links)

You can use CSS selector a[id*="dlstCollege"] to filter only links you want.

For example:

import requests as rq
from bs4 import BeautifulSoup

res = requests.get('https://erp.aktu.ac.in/WebPages/KYC/CollegeList.aspx?City=&CType=&Cu=&Br=&Inst=&IType=')
soup = BeautifulSoup(res.content, 'html.parser')


table = soup.find("table", attrs = {'class':'table table-bordered table-responsive'})

pagelink = []
for anchor in table.select('a[id*="dlstCollege"]')[1:]:
        link = anchor['href']
        print(link)
        url = 'https://erp.aktu.ac.in/WebPages/KYC/'+ link
        pagelink.append(url)

Prints:

CollegeDetailedInformation.aspx?Inst=866  
CollegeDetailedInformation.aspx?Inst=486  
CollegeDetailedInformation.aspx?Inst=602  
CollegeDetailedInformation.aspx?Inst=032  
CollegeDetailedInformation.aspx?Inst=290  
CollegeDetailedInformation.aspx?Inst=913  
CollegeDetailedInformation.aspx?Inst=643  

...and so on.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM