[英]Python: Scan all sites of a website for specific URLs
I want to scan my forum for specific links. 我想扫描论坛中的特定链接。 All links look like this: http://www.vbulletinxyz-forum.tld/forum/showthread.php?t=17590
. 所有链接看起来像这样: http://www.vbulletinxyz-forum.tld/forum/showthread.php?t=17590
。 Only the thread-number at the end of the link changes. 仅链接末尾的线程号更改。
Currently I am using the following code, but it only works for one specific URL, not all threads of the forum. 目前,我正在使用以下代码,但它仅适用于一个特定的URL,不适用于论坛的所有线程。 How would I have to change the code to let it scan all threads? 我将如何更改代码以使其扫描所有线程?
import urllib
mypath = "http://vbulletin-forumxyz.tld/forum/showthread.php?t=1"
mylines = urllib.urlopen(mypath).readlines()
for item in mylines:
if "http://specific.tld" in item:
print item[item.index("http://specific.tld"):]
(1) is easy to implement but probably not all thread numbers (t) are existent. (1)易于实现,但可能并非所有线程号(t)都存在。 So there will be a lot of 404 requests. 因此将有很多404请求。
(2) take a look at scrapy (2)看刮y
update (1): here is how it can be done in principle. 更新(1):原则上可以这样做。 Note that a) the url you provided is not reachable (dummy) so i did not test it and b) its python 3.X 请注意,a)您提供的网址无法访问(虚拟),因此我没有对其进行测试,b)其python 3.X
import urllib.request
import time
def mypath(t):
return "http://vbulletin-forumxyz.tld/forum/showthread.php?t={}".format(t)
for t in range(2):
conn = urllib.request.urlopen(mypath(t))
# check status code
if conn.getcode() != 200:
continue
mylines = conn.read().decode('utf-8').splitlines()
for item in mylines:
if "http://specific.tld" in item:
print(item)
# avoid fetching to fast (you might get banned otherwise)
time.sleep(0.5)
This is how it works and checks threads from 0 to 400,000. 这就是它的工作方式,并检查从0到400,000的线程。
import urllib.request
import time
import codecs
def mypath(t):
return "http://www.someforum.org/forum/showthread.php?t={}".format(t)
for t in range(0,400000):
conn = urllib.request.urlopen(mypath(t))
# check status code
if conn.getcode() != 200:
continue
mylines = conn.read().decode('windows-1251').splitlines()
for item in mylines:
if "http://someurl.tld" in item:
print(item)
# avoid fetching to fast (you might get banned otherwise)
# time.sleep(0.5)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.