<td> <input type="hidden" name="ctl00$ContentPlaceHolder1$dlstCollege$ctl01$hdnInstituteId" id="ContentPlaceHolder1_dlstCollege_hdnInstituteId_1" value="866 " /> <a id="ContentPlaceHolder1_dlstCollege_hlpkInstituteName_1" href="CollegeDetailedInformation.aspx?Inst=866 ">ANA INSTITUTE OF PHARMACEUTICAL SCIENCES & RESEARCH,BAREILLY (866)</a> <br /> <b>Location:</b> <span id="ContentPlaceHolder1_dlstCollege_lblAddress_1">13.5 km Bareilly - Delhi road, near rubber factory agras road,Bareilly</span> <br /> <b>Course:</b> <span id="ContentPlaceHolder1_dlstCollege_lblCourse_1">B.Pharm,</span> <br /> <b>Category:</b> <span id="ContentPlaceHolder1_dlstCollege_lblInstituteType_1">Private</span> <br /> <b>Web Address:</b> <a id="lnkBtnWebURL" href='' target="_blank"></a> <br /> </td> </tr> <tr> <td> <input type="hidden" name="ctl00$ContentPlaceHolder1$dlstCollege$ctl02$hdnInstituteId" id="ContentPlaceHolder1_dlstCollege_hdnInstituteId_2" value="486 " /> <a id="ContentPlaceHolder1_dlstCollege_hlpkInstituteName_2" href="CollegeDetailedInformation.aspx?Inst=486 ">ANACOLLEGE OF ENGINEERING & MANAGEMENT,BAREILLY (486)</a> <br /> <b>Location:</b> <span id="ContentPlaceHolder1_dlstCollege_lblAddress_2">13.5 Km. NH-24, Bareilly-Delhi Highway, Near Rubber Factory, Bareilly</span> <br /> <b>Course:</b> <span id="ContentPlaceHolder1_dlstCollege_lblCourse_2">B.Tech,M.Tech,</span> <br /> <b>Category:</b> <span id="ContentPlaceHolder1_dlstCollege_lblInstituteType_2">Private</span> <br /> <b>Web Address:</b> <a id="lnkBtnWebURL" href='http://www.anacollege.org/index.html' target="_blank">http://www.anacollege.org/index.html</a> <br /> </td> </tr>
I want to extract a particular URL(for eg: CollegeDetailedInformation.aspx?Inst=866) from this website but this code has two tags one of which I don't want(for eg: http://www.anacollege.org/index.html ).
res = requests.get('https://erp.aktu.ac.in/WebPages/KYC/CollegeList.aspx?City=&CType=&Cu=&Br=&Inst=&IType=')
soup = BeautifulSoup(res.content, 'html.parser')
table = soup.find("table", attrs = {'class':'table table-bordered table-responsive'})
pagelink = []
for anchor in table.findAll('a')[1:]:
link = anchor['href']
print(link)
url = 'https://erp.aktu.ac.in/WebPages/KYC/'+ link
pagelink.append(url)
print(pagelinks)
I wrote this code but it is extracting all the links
CollegeDetailedInformation.aspx?Inst=486
http://www.anacollege.org/index.html
CollegeDetailedInformation.aspx?Inst=602
http://www.aashlarbschool.com
CollegeDetailedInformation.aspx?Inst=032
http://www.abes.ac.in
CollegeDetailedInformation.aspx?Inst=290
http://www.abesit.in
CollegeDetailedInformation.aspx?Inst=913
http://www.abesitpharmacy.in
CollegeDetailedInformation.aspx?Inst=643
http://www.vitsald.com
CollegeDetailedInformation.aspx?Inst=1036
http://www.abss.edu.in
how do I solve this I only want the link with the CollegeDetailedInformation.aspx?Inst=? part.
The anchor elements which are links to view college details have an id
attribute which starts with ContentPlaceHolder1_dlstCollege_
. So pass that as a regex to the attrs
argument of find_all()
:
import re
for anchor in table.findAll('a', attrs={"id": re.compile("^ContentPlaceHolder1_dlstCollege_.*")}):
...
You can also just pass that as an id
keyword argument to find_all()
:
for anchor in table.findAll('a', id=re.compile("^ContentPlaceHolder1_dlstCollege_.*")):
...
The regex can be made even more specific, like "^ContentPlaceHolder1_dlstCollege_hlpkInstituteName_.*"
which should only match the link provided with the college's name.
(I would remove the [1:]
you put at the end since this probably filters out the link at the start you didn't want. If it doesn't, then add it back in.)
You can use CSS selector
and use it to find all link a[href*=CollegeDetailedInformation]
Whatever you want.
import requests
from bs4 import BeautifulSoup
res = requests.get('https://erp.aktu.ac.in/WebPages/KYC/CollegeList.aspx?City=&CType=&Cu=&Br=&Inst=&IType=')
soup = BeautifulSoup(res.content, 'html.parser')
table = soup.find("table", attrs = {'class':'table table-bordered table-responsive'})
allAnchor = table.select("a[href*=CollegeDetailedInformation]")
pagelink = []
for anchor in allAnchor:
link = anchor['href']
# print(link)
url = 'https://erp.aktu.ac.in/WebPages/KYC/'+ link
pagelink.append(url)
print(pagelink)
Output will be:
['https://erp.aktu.ac.in/WebPages/KYC/CollegeDetailedInformation.aspx?Inst=968 ',
'https://erp.aktu.ac.in/WebPages/KYC/CollegeDetailedInformation.aspx?Inst=866 ',
'https://erp.aktu.ac.in/WebPages/KYC/CollegeDetailedInformation.aspx?Inst=486 ',
'https://erp.aktu.ac.in/WebPages/KYC/CollegeDetailedInformation.aspx?Inst=602 ',
'https://erp.aktu.ac.in/WebPages/KYC/CollegeDetailedInformation.aspx?Inst=032 ',
'https://erp.aktu.ac.in/WebPages/KYC/CollegeDetailedInformation.aspx?Inst=290 ',
'https://erp.aktu.ac.in/WebPages/KYC/CollegeDetailedInformation.aspx?Inst=913 ',
'https://erp.aktu.ac.in/WebPages/KYC/CollegeDetailedInformation.aspx?Inst=643 ',
'https://erp.aktu.ac.in/WebPages/KYC/CollegeDetailedInformation.aspx?Inst=1036 ',
'https://erp.aktu.ac.in/WebPages/KYC/CollegeDetailedInformation.aspx?Inst=312 ',
'https://erp.aktu.ac.in/WebPages/KYC/CollegeDetailedInformation.aspx?Inst=986 ',
'https://erp.aktu.ac.in/WebPages/KYC/CollegeDetailedInformation.aspx?Inst=686 ',
'https://erp.aktu.ac.in/WebPages/KYC/CollegeDetailedInformation.aspx?Inst=805 ',
'https://erp.aktu.ac.in/WebPages/KYC/CollegeDetailedInformation.aspx?Inst=225 ',
'https://erp.aktu.ac.in/WebPages/KYC/CollegeDetailedInformation.aspx?Inst=799 ',
'https://erp.aktu.ac.in/WebPages/KYC/CollegeDetailedInformation.aspx?Inst=041 ',
'https://erp.aktu.ac.in/WebPages/KYC/CollegeDetailedInformation.aspx?Inst=952 ',
and so on....
]
I don't know Python but the general rule would be to populate an array in the for loop then do a lookup for a sub-string having your filter, pick the index and grab all that is in that index.
Initialize and empty array outside loop (if empty is allowed in Python), populate it in the loop, then do something like in_array (for php) for your filter: CollegeDetailedInformation.aspx?Inst=?.
This should be a good start as the masters of Python come on board to help.
Try the following code snippet. Also install **lxml**
library with pip before proceeding
import requests as rq
from bs4 import BeautifulSoup as bs
es = rq.get('https://erp.aktu.ac.in/WebPages/KYC/CollegeList.aspx?City=&CType=&Cu=&Br=&Inst=&IType=')
soup = bs(res.content, 'lxml')
table = soup.find("table", attrs = {'class':'table table-bordered table-responsive'})
links = [elem.strip() for anchor in table.findAll('a') for _,elem in anchor.attrs.items() if "=" in elem]
print(links)
You can use CSS selector a[id*="dlstCollege"]
to filter only links you want.
For example:
import requests as rq
from bs4 import BeautifulSoup
res = requests.get('https://erp.aktu.ac.in/WebPages/KYC/CollegeList.aspx?City=&CType=&Cu=&Br=&Inst=&IType=')
soup = BeautifulSoup(res.content, 'html.parser')
table = soup.find("table", attrs = {'class':'table table-bordered table-responsive'})
pagelink = []
for anchor in table.select('a[id*="dlstCollege"]')[1:]:
link = anchor['href']
print(link)
url = 'https://erp.aktu.ac.in/WebPages/KYC/'+ link
pagelink.append(url)
Prints:
CollegeDetailedInformation.aspx?Inst=866
CollegeDetailedInformation.aspx?Inst=486
CollegeDetailedInformation.aspx?Inst=602
CollegeDetailedInformation.aspx?Inst=032
CollegeDetailedInformation.aspx?Inst=290
CollegeDetailedInformation.aspx?Inst=913
CollegeDetailedInformation.aspx?Inst=643
...and so on.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.