[英]Find and list specific links in a webpage using Python
1.a From the links present on the source code of a webpage i want to make a list of all links like "mypage.php?REF=1137988" which is mypage.php?REF= followed by a number 1.a从网页的源代码上存在的链接中,我要列出所有链接的列表,例如“ mypage.php?REF = 1137988”,即mypage.php?REF =,后跟一个数字
1.b. 1.b. However this source page also contain somethings like Supp.Form.php?REF=1137988 which i wish to avoid. 但是,此源页面还包含Supp.Form.php?REF = 1137988之类的内容,希望避免。
</TD></TR>
</TABLE>
<FONT CLASS=t><TABLE cellspacing=5><TR><TD bgcolor='#FFFFA0' style='border:5px ridge lightgray'><TABLE cellspacing=4><TR><TD VALIGN=top><FONT CLASS=t2><CENTER>2015-09-03<BR><TABLE cellspacing=4><TR><TD bgcolor='#FFFFFF' style='border:4px ridge lightgray'><CENTER><FONT CLASS=t9>1137988 <A HREF='SuppForm.php?REF=1137988' target='_blank'><IMG SRC='boutons/supp.gif' width=12 height=12 border=0 TITLE='delete'></A> <A HREF='ModifForm.php?REF=1137988' target='_blank'><IMG SRC='boutons/modif.gif' width=10 height=11 border=0 TITLE='modify'></A><BR><TABLE cellspacing=4><TR><TD bgcolor='#FFFFA0' style='border:4px ridge lightgray'><TABLE><TR><TD><IMG SRC='faces/F.gif' width=36 border=0></TD><TD><CENTER><FONT SIZE=1>Age<BR></FONT><FONT SIZE=5><B>35</TD></TR></TABLE></TD></TR></TABLE></TD></TR></TABLE></TD><TD WIDTH=50%><CENTER><FONT class=t><A HREF='mypage.php?REF=1137988' TARGET='_blank'><I>
</pre>
Here is my code so far, which i have been trying to implement 到目前为止,这是我一直在尝试实现的代码
from bs4 import BeautifulSoup
import urllib2
url = "http://wwww.somewebsite.com"
headers = { 'User-Agent' : 'Mozilla/5.0' }
html = urllib2.urlopen(urllib2.Request(url, None, headers)).read()
soup = BeautifulSoup(html)
links = soup.find_all("a")
for link in links:
print "A HREF=mypage.php?REF=" %(link.get("a"), link.text)
print links
which means that the number that i will extract from the first list i will have to separate them all with a comma to put inside the replace = [ ] 这意味着我将从第一个列表中提取的数字我必须用逗号将它们全部分隔开,以将它们放入replace = []中
template = """fjajflakjfakjfl;kj REF={} sklkasalsjklas klajsl;kdajs;djas aksljl;askjflka """ replace = [1131062, 1140921, 1141326, 1141355, 1141426, 1141430, 1141461, 1141473, 1141477, 1141502] output = [template.format(r) for r in replace] with open('output.txt', 'w') as f_output: f_output.write(''.join([template.format(r) for r in replace]))
so please help with the two things that i wish to do here. 所以请帮助我在这里做两件事。 sorry if the formatting is a bit off. 对不起,如果格式化有点问题。
thank you very much. 非常感谢你。
as suggested by @wilbur i modified my code this is what i did 如@wilbur所建议,我修改了我的代码,这就是我所做的
from bs4 import BeautifulSoup
import urllib2
import re
url = "somewebsite"
headers = { 'User-Agent' : 'Mozilla/5.0' }
html = urllib2.urlopen(urllib2.Request(url, None, headers)).read()
soup = BeautifulSoup(html)
links = soup.findAll('a', href=re.compile('.*mypage\.php\?REF=[0-9]*'))
template = """lasljasfkljaslkfj{}
slajfljasflk
aslkjfklasjflkasjf
alksjflkasjf;lk
"""
replace = [ link.split("=")[1] for link in links ]
output = [template.format(r) for r in replace]
print output
with open('output.txt', 'w') as f_output:
f_output.write(''.join([template.format(r) for r in replace]))
The following will grab all the links that match your description and then get the REF parameters from each and put them into replace. 下面将获取与您的描述匹配的所有链接,然后从每个链接中获取REF参数并将其替换。
from bs4 import BeautifulSoup
import urllib2
url = "http://wwww.somewebsite.com"
headers = { 'User-Agent' : 'Mozilla/5.0' }
html = urllib2.urlopen(urllib2.Request(url, None, headers)).read()
soup = BeautifulSoup(html)
links = soup.findAll('a', href=re.compile('.*mypage\.php\?REF=[0-9]*'))
replace = [ link['href'].split("=")[1] for link in links ]
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.