簡體   English   中英

Foreach循環獲取BeautifulSoup / Mechanize / Python鏈接的下一頁

[英]Foreach loop to get next page of links with BeautifulSoup/Mechanize/Python

我有一個觀點

def Processinitialscan(request):
    EnteredDomain = request.GET.get('domainNm')

    #get raw output
    getDomainLinksFromGoo = settings.GOOGLE_BASEURL_FOR_HARVEST+settings.GOO_RESULT_DOMAIN_QUERIED+EnteredDomain
    rawGatheredGooOutput = mechanizeBrowser.open(getDomainLinksFromGoo)

    beautifulSoupObj = BeautifulSoup(mechanizeBrowser.response().read()) #read the raw response
    getFirstPageLinks = beautifulSoupObj.find_all('cite') #get first page of urls

    pattern = re.compile('^.*start=')   #set regex to search on - find anything like: " <domain and path here>start= "
    getRemainingPageUrls = beautifulSoupObj.find_all('a',attrs={'class': 'fl', 'href': pattern})

    NumberOfUrlsFound = len(getRemainingPageUrls)

    MaxUrlsToGather = ((NumberOfUrlsFound*10)+settings.GOOGLE_RESULT_AMT_ACCOUNT_FOR_PAGE_1) # +10 because 10 represents the urls on the first page

    url_data = UrlData(NumberOfUrlsFound, pattern) 
    #return HttpResponse(MaxUrlsToGather)

    return render(request, 'VA/scan/process_scan.html', {
        'url_data':url_data,'EnteredDomain':EnteredDomain,'getDomainLinksFromGoo':getDomainLinksFromGoo,
        'getRemainingPageUrls' : getRemainingPageUrls, 'NumberOfUrlsFound':NumberOfUrlsFound,
        'getFirstPageLinks' : getFirstPageLinks, 'MaxUrlsToGather' : MaxUrlsToGather
    })

和一個模板

{% block block_containercontent %}
    {% autoescape on %}
    <h1>{{ EnteredDomain }}</h1>
<strong>url used: </strong>{{ getDomainLinksFromGoo }}<br />
<hr>
<br>
<strong>first page of links</strong> {{ getFirstPageLinks }}
<hr>
<br><strong>number of "next" links</strong> {{ NumberOfUrlsFound }}
<hr>
<br>
<strong>remaining urls:</strong> {{ getRemainingPageUrls }}
    {% if url_data.num_of_urls > 1 %}
    {% for url in url_data.url_list %}
        {{ url }}
    {% endfor %}
{% endif %}
    {% endautoescape %}

{% endblock block_containercontent %}

該模板輸出:

url used: https://www.google.com/search?q=site%3Aasite.com


first page of links [<cite>www.google.com/webmasters/</cite>, <cite>www.asite.com</cite>, <cite>www.asite.com/blog/</cite>, <cite>www.asite.com/blog/projects/</cite>, <cite>www.asite.com/blog/category/internet/</cite>, <cite>www.asite.com/blog/category/goals/</cite>, <cite>www.asite.com/blog/category/uncategorized/</cite>, <cite>www.asite.com/blog/why-i-left-facebook/2013/01/</cite>, <cite>www.asite.com/blog/category/startups-2/</cite>, <cite>www.asite.com/blog/category/goals/</cite>, <cite>www.asite.com/blog/category/internet/</cite>]


number of "next" links 2

我的問題:如何在模板中的循環中利用NumberOfUrlsFound生成鏈接,例如:/ /search?q=site:entereddomain.com&start=10 /search?q=site:entereddomain.com&start=20 : /search?q=site:entereddomain.com&start=10 ,/ /search?q=site:entereddomain.com&start=20/search?q=site:entereddomain.com&start=10 /search?q=site:entereddomain.com&start=20 ,然后點擊鏈接根據NumberOfUrlsFound的值使用beautifulsoup。 因此,如果NumberOfUrlsFound = 2,則應生成網址search?q=site:asite.com&start=10 ,還應生成search?q=site:asite.com&start=20 ,此外:

(puesdo代碼..):

if(NumberOfUrlsFound > 1)
    foreach(NumberOfUrlsFound)
        # generate url with start=n+10  
        ## asite.com?/search?start=10  
        ## Then ...
        ## asite.com?/search?start=20
        ## and so on ..
        # where n represents the previous number
        # this n number is determined by `NumberOfUrlsFound` which might have a value of 2 for example
        # this value of 2 represents a max value of start=20 value to generate urls on.

您可能可以創建一個數據對象來表示要在模板中顯示的數據。

class UrlData(object):
    def __init__(self, num_of_urls, url_pattern):
        self.num_of_urls = num_of_urls
        self.url_pattern = url_pattern

    def url_list(self):
        # Returns a list of strings that represent the urls you want based on num_of_urls
        # e.g. asite.com/?search?start=10
        urls = []
        for i in xrange(self.num_of_urls):
            urls.append(self.url_pattern + 'start=' + str((i + 1) * 10))
        return urls

在您的views.py中

# Create a UrlData object from NumberOfUrlsFound and a url_pattern
# url_pattern being the asite.com/?search?start=
url_data = UrlData(NumberOfUrlsFound, getDomainLinksFromGoogle) 

return render(request, template, {'url_data': url_data, ...})

只需在視圖函數中使用數據創建對象,然后將該對象傳遞給模板。

在模板中,您可以執行以下操作:

# Mirroring your check
{% if url_data.num_of_urls > 1 %} 
    # We'll iterate through the url_list created from the function defined in UrlData
    {% for url in url_data.url_list %}
         {{ url }} # asite.com/?search...
    {% endfor %}
{% endif %}

在模板中,當您調用url_data.url_list它將在UrlData運行該函數

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM