简体   繁体   English

使用Python查找并列出网页中的特定链接

[英]Find and list specific links in a webpage using Python

1.a From the links present on the source code of a webpage i want to make a list of all links like "mypage.php?REF=1137988" which is mypage.php?REF= followed by a number 1.a从网页的源代码上存在的链接中,我要列出所有链接的列表,例如“ mypage.php?REF = 1137988”,即mypage.php?REF =,后跟一个数字

1.b. 1.b. However this source page also contain somethings like Supp.Form.php?REF=1137988 which i wish to avoid. 但是,此源页面还包含Supp.Form.php?REF = 1137988之类的内容,希望避免。

</TD></TR>
</TABLE>
<FONT CLASS=t><TABLE cellspacing=5><TR><TD bgcolor='#FFFFA0' style='border:5px ridge lightgray'><TABLE cellspacing=4><TR><TD VALIGN=top><FONT CLASS=t2><CENTER>2015-09-03<BR><TABLE cellspacing=4><TR><TD bgcolor='#FFFFFF' style='border:4px ridge lightgray'><CENTER><FONT CLASS=t9>1137988 <A HREF='SuppForm.php?REF=1137988' target='_blank'><IMG SRC='boutons/supp.gif' width=12 height=12 border=0 TITLE='delete'></A> <A HREF='ModifForm.php?REF=1137988' target='_blank'><IMG SRC='boutons/modif.gif' width=10 height=11 border=0 TITLE='modify'></A><BR><TABLE cellspacing=4><TR><TD bgcolor='#FFFFA0' style='border:4px ridge lightgray'><TABLE><TR><TD><IMG SRC='faces/F.gif' width=36 border=0></TD><TD><CENTER><FONT SIZE=1>Age<BR></FONT><FONT SIZE=5><B>35</TD></TR></TABLE></TD></TR></TABLE></TD></TR></TABLE></TD><TD WIDTH=50%><CENTER><FONT class=t><A HREF='mypage.php?REF=1137988' TARGET='_blank'><I>
</pre>

Here is my code so far, which i have been trying to implement 到目前为止,这是我一直在尝试实现的代码

from bs4 import BeautifulSoup
import urllib2
url = "http://wwww.somewebsite.com"

headers = { 'User-Agent' : 'Mozilla/5.0' }
html = urllib2.urlopen(urllib2.Request(url, None, headers)).read()
soup = BeautifulSoup(html)
links = soup.find_all("a")
for link in links:
print "A HREF=mypage.php?REF=" %(link.get("a"), link.text)

print links
  1. i also want to just put the number after REF in a list. 我也想将REF之后的数字放在列表中。 which i will put in the numbers part of this code / 我将在此代码的数字部分/
  2. which means that the number that i will extract from the first list i will have to separate them all with a comma to put inside the replace = [ ] 这意味着我将从第一个列表中提取的数字我必须用逗号将它们全部分隔开,以将它们放入replace = []中

     template = """fjajflakjfakjfl;kj REF={} sklkasalsjklas klajsl;kdajs;djas aksljl;askjflka """ replace = [1131062, 1140921, 1141326, 1141355, 1141426, 1141430, 1141461, 1141473, 1141477, 1141502] output = [template.format(r) for r in replace] with open('output.txt', 'w') as f_output: f_output.write(''.join([template.format(r) for r in replace])) 

so please help with the two things that i wish to do here. 所以请帮助我在这里做两件事。 sorry if the formatting is a bit off. 对不起,如果格式化有点问题。

thank you very much. 非常感谢你。

as suggested by @wilbur i modified my code this is what i did 如@wilbur所建议,我修改了我的代码,这就是我所做的

from bs4 import BeautifulSoup
import urllib2
import re

url = "somewebsite"

headers = { 'User-Agent' : 'Mozilla/5.0' }
html = urllib2.urlopen(urllib2.Request(url, None, headers)).read()
soup = BeautifulSoup(html)

links = soup.findAll('a', href=re.compile('.*mypage\.php\?REF=[0-9]*'))
template = """lasljasfkljaslkfj{}
slajfljasflk
aslkjfklasjflkasjf
alksjflkasjf;lk
"""

replace = [ link.split("=")[1] for link in links ]

output = [template.format(r) for r in replace]

print output
with open('output.txt', 'w') as f_output:
    f_output.write(''.join([template.format(r) for r in replace]))

The following will grab all the links that match your description and then get the REF parameters from each and put them into replace. 下面将获取与您的描述匹配的所有链接,然后从每个链接中获取REF参数并将其替换。

from bs4 import BeautifulSoup
import urllib2
url = "http://wwww.somewebsite.com"

headers = { 'User-Agent' : 'Mozilla/5.0' }
html = urllib2.urlopen(urllib2.Request(url, None, headers)).read()
soup = BeautifulSoup(html)
links = soup.findAll('a', href=re.compile('.*mypage\.php\?REF=[0-9]*'))

replace = [ link['href'].split("=")[1] for link in links ]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM