1.a From the links present on the source code of a webpage i want to make a list of all links like "mypage.php?REF=1137988" which is mypage.php?REF= followed by a number
1.b. However this source page also contain somethings like Supp.Form.php?REF=1137988 which i wish to avoid.
</TD></TR>
</TABLE>
<FONT CLASS=t><TABLE cellspacing=5><TR><TD bgcolor='#FFFFA0' style='border:5px ridge lightgray'><TABLE cellspacing=4><TR><TD VALIGN=top><FONT CLASS=t2><CENTER>2015-09-03<BR><TABLE cellspacing=4><TR><TD bgcolor='#FFFFFF' style='border:4px ridge lightgray'><CENTER><FONT CLASS=t9>1137988 <A HREF='SuppForm.php?REF=1137988' target='_blank'><IMG SRC='boutons/supp.gif' width=12 height=12 border=0 TITLE='delete'></A> <A HREF='ModifForm.php?REF=1137988' target='_blank'><IMG SRC='boutons/modif.gif' width=10 height=11 border=0 TITLE='modify'></A><BR><TABLE cellspacing=4><TR><TD bgcolor='#FFFFA0' style='border:4px ridge lightgray'><TABLE><TR><TD><IMG SRC='faces/F.gif' width=36 border=0></TD><TD><CENTER><FONT SIZE=1>Age<BR></FONT><FONT SIZE=5><B>35</TD></TR></TABLE></TD></TR></TABLE></TD></TR></TABLE></TD><TD WIDTH=50%><CENTER><FONT class=t><A HREF='mypage.php?REF=1137988' TARGET='_blank'><I>
</pre>
Here is my code so far, which i have been trying to implement
from bs4 import BeautifulSoup
import urllib2
url = "http://wwww.somewebsite.com"
headers = { 'User-Agent' : 'Mozilla/5.0' }
html = urllib2.urlopen(urllib2.Request(url, None, headers)).read()
soup = BeautifulSoup(html)
links = soup.find_all("a")
for link in links:
print "A HREF=mypage.php?REF=" %(link.get("a"), link.text)
print links
which means that the number that i will extract from the first list i will have to separate them all with a comma to put inside the replace = [ ]
template = """fjajflakjfakjfl;kj REF={} sklkasalsjklas klajsl;kdajs;djas aksljl;askjflka """ replace = [1131062, 1140921, 1141326, 1141355, 1141426, 1141430, 1141461, 1141473, 1141477, 1141502] output = [template.format(r) for r in replace] with open('output.txt', 'w') as f_output: f_output.write(''.join([template.format(r) for r in replace]))
so please help with the two things that i wish to do here. sorry if the formatting is a bit off.
thank you very much.
as suggested by @wilbur i modified my code this is what i did
from bs4 import BeautifulSoup
import urllib2
import re
url = "somewebsite"
headers = { 'User-Agent' : 'Mozilla/5.0' }
html = urllib2.urlopen(urllib2.Request(url, None, headers)).read()
soup = BeautifulSoup(html)
links = soup.findAll('a', href=re.compile('.*mypage\.php\?REF=[0-9]*'))
template = """lasljasfkljaslkfj{}
slajfljasflk
aslkjfklasjflkasjf
alksjflkasjf;lk
"""
replace = [ link.split("=")[1] for link in links ]
output = [template.format(r) for r in replace]
print output
with open('output.txt', 'w') as f_output:
f_output.write(''.join([template.format(r) for r in replace]))
The following will grab all the links that match your description and then get the REF parameters from each and put them into replace.
from bs4 import BeautifulSoup
import urllib2
url = "http://wwww.somewebsite.com"
headers = { 'User-Agent' : 'Mozilla/5.0' }
html = urllib2.urlopen(urllib2.Request(url, None, headers)).read()
soup = BeautifulSoup(html)
links = soup.findAll('a', href=re.compile('.*mypage\.php\?REF=[0-9]*'))
replace = [ link['href'].split("=")[1] for link in links ]
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.