简体   繁体   中英

Find and list specific links in a webpage using Python

1.a From the links present on the source code of a webpage i want to make a list of all links like "mypage.php?REF=1137988" which is mypage.php?REF= followed by a number

1.b. However this source page also contain somethings like Supp.Form.php?REF=1137988 which i wish to avoid.

</TD></TR>
</TABLE>
<FONT CLASS=t><TABLE cellspacing=5><TR><TD bgcolor='#FFFFA0' style='border:5px ridge lightgray'><TABLE cellspacing=4><TR><TD VALIGN=top><FONT CLASS=t2><CENTER>2015-09-03<BR><TABLE cellspacing=4><TR><TD bgcolor='#FFFFFF' style='border:4px ridge lightgray'><CENTER><FONT CLASS=t9>1137988 <A HREF='SuppForm.php?REF=1137988' target='_blank'><IMG SRC='boutons/supp.gif' width=12 height=12 border=0 TITLE='delete'></A> <A HREF='ModifForm.php?REF=1137988' target='_blank'><IMG SRC='boutons/modif.gif' width=10 height=11 border=0 TITLE='modify'></A><BR><TABLE cellspacing=4><TR><TD bgcolor='#FFFFA0' style='border:4px ridge lightgray'><TABLE><TR><TD><IMG SRC='faces/F.gif' width=36 border=0></TD><TD><CENTER><FONT SIZE=1>Age<BR></FONT><FONT SIZE=5><B>35</TD></TR></TABLE></TD></TR></TABLE></TD></TR></TABLE></TD><TD WIDTH=50%><CENTER><FONT class=t><A HREF='mypage.php?REF=1137988' TARGET='_blank'><I>
</pre>

Here is my code so far, which i have been trying to implement

from bs4 import BeautifulSoup
import urllib2
url = "http://wwww.somewebsite.com"

headers = { 'User-Agent' : 'Mozilla/5.0' }
html = urllib2.urlopen(urllib2.Request(url, None, headers)).read()
soup = BeautifulSoup(html)
links = soup.find_all("a")
for link in links:
print "A HREF=mypage.php?REF=" %(link.get("a"), link.text)

print links
  1. i also want to just put the number after REF in a list. which i will put in the numbers part of this code /
  2. which means that the number that i will extract from the first list i will have to separate them all with a comma to put inside the replace = [ ]

     template = """fjajflakjfakjfl;kj REF={} sklkasalsjklas klajsl;kdajs;djas aksljl;askjflka """ replace = [1131062, 1140921, 1141326, 1141355, 1141426, 1141430, 1141461, 1141473, 1141477, 1141502] output = [template.format(r) for r in replace] with open('output.txt', 'w') as f_output: f_output.write(''.join([template.format(r) for r in replace])) 

so please help with the two things that i wish to do here. sorry if the formatting is a bit off.

thank you very much.

as suggested by @wilbur i modified my code this is what i did

from bs4 import BeautifulSoup
import urllib2
import re

url = "somewebsite"

headers = { 'User-Agent' : 'Mozilla/5.0' }
html = urllib2.urlopen(urllib2.Request(url, None, headers)).read()
soup = BeautifulSoup(html)

links = soup.findAll('a', href=re.compile('.*mypage\.php\?REF=[0-9]*'))
template = """lasljasfkljaslkfj{}
slajfljasflk
aslkjfklasjflkasjf
alksjflkasjf;lk
"""

replace = [ link.split("=")[1] for link in links ]

output = [template.format(r) for r in replace]

print output
with open('output.txt', 'w') as f_output:
    f_output.write(''.join([template.format(r) for r in replace]))

The following will grab all the links that match your description and then get the REF parameters from each and put them into replace.

from bs4 import BeautifulSoup
import urllib2
url = "http://wwww.somewebsite.com"

headers = { 'User-Agent' : 'Mozilla/5.0' }
html = urllib2.urlopen(urllib2.Request(url, None, headers)).read()
soup = BeautifulSoup(html)
links = soup.findAll('a', href=re.compile('.*mypage\.php\?REF=[0-9]*'))

replace = [ link['href'].split("=")[1] for link in links ]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM