Python：扫描网站的所有站点以获取特定的URL

Question

I want to scan my forum for specific links. 我想扫描论坛中的特定链接。 All links look like this: http://www.vbulletinxyz-forum.tld/forum/showthread.php?t=17590 . 所有链接看起来像这样： http://www.vbulletinxyz-forum.tld/forum/showthread.php?t=17590 。 Only the thread-number at the end of the link changes. 仅链接末尾的线程号更改。

Currently I am using the following code, but it only works for one specific URL, not all threads of the forum. 目前，我正在使用以下代码，但它仅适用于一个特定的URL，不适用于论坛的所有线程。 How would I have to change the code to let it scan all threads? 我将如何更改代码以使其扫描所有线程？

import urllib
mypath = "http://vbulletin-forumxyz.tld/forum/showthread.php?t=1"
mylines = urllib.urlopen(mypath).readlines()
for item in mylines:
    if "http://specific.tld" in item:
        print item[item.index("http://specific.tld"):]

Answer 1

either by trying all thread numbers 通过尝试所有线程号
or by using a spider which follows links (and discovers new threads) 或使用跟随链接的蜘蛛程序（并发现新线程）

(1) is easy to implement but probably not all thread numbers (t) are existent. （1）易于实现，但可能并非所有线程号（t）都存在。 So there will be a lot of 404 requests. 因此将有很多404请求。

(2) take a look at scrapy （2）看刮y

update (1): here is how it can be done in principle. 更新（1）：原则上可以这样做。 Note that a) the url you provided is not reachable (dummy) so i did not test it and b) its python 3.X 请注意，a）您提供的网址无法访问（虚拟），因此我没有对其进行测试，b）其python 3.X

import urllib.request
import time


def mypath(t):
    return "http://vbulletin-forumxyz.tld/forum/showthread.php?t={}".format(t)


for t in range(2):
    conn = urllib.request.urlopen(mypath(t))

    # check status code
    if conn.getcode() != 200:
        continue

    mylines = conn.read().decode('utf-8').splitlines()
    for item in mylines:
        if "http://specific.tld" in item:
            print(item)

   # avoid fetching to fast (you might get banned otherwise)
    time.sleep(0.5)

Answer 2

This is how it works and checks threads from 0 to 400,000. 这就是它的工作方式，并检查从0到400,000的线程。

import urllib.request
import time
import codecs

def mypath(t):
    return "http://www.someforum.org/forum/showthread.php?t={}".format(t)


for t in range(0,400000):
    conn = urllib.request.urlopen(mypath(t))

    # check status code
    if conn.getcode() != 200:
        continue

    mylines = conn.read().decode('windows-1251').splitlines()

    for item in mylines:
        if "http://someurl.tld" in item:
            print(item)

   # avoid fetching to fast (you might get banned otherwise)
   # time.sleep(0.5)

Python：扫描网站的所有站点以获取特定的URL

问题描述

2 个解决方案

解决方案1
1 2019-07-07 09:07:47

解决方案2
0 2019-07-07 13:24:33

Python：扫描网站的所有站点以获取特定的URL

问题描述

2 个解决方案

解决方案1 1 2019-07-07 09:07:47

解决方案2 0 2019-07-07 13:24:33

解决方案1
1 2019-07-07 09:07:47

解决方案2
0 2019-07-07 13:24:33