簡體   English   中英

如何使用網絡搜尋器從URL獲取正確的源代碼?

[英]How to get the right source code with Python from the URLs using my web crawler?

我正在嘗試使用python編寫網絡搜尋器。 我正在使用rerequest模塊。 我想從第一頁(這是一個論壇)獲取URL,並從每個URL獲取信息。

現在的問題是,我已經將URL存儲在列表中。 但是,我無法進一步獲得這些URL的正確源代碼。

這是我的代碼:

import re
import requests

url = 'http://bbs.skykiwi.com/forum.php?mod=forumdisplay&fid=55&typeid=470&sortid=231&filter=typeid&pageNum=1&page=1'

sourceCode = getsourse(url) # source code of the url page
allLinksinPage = getallLinksinPage(sourceCode) #a List of the urls in current page
for eachLink in allLinksinPage:
    url = 'http://bbs.skykiwi.com/' + eachLink.encode('utf-8')
    html = getsourse(url) #THIS IS WHERE I CAN'T GET THE RIGHT SOURCE CODE


#To get the source code of current url
def getsourse(url):
    header = {'User-Agent': 'Mozilla/5.0 (compatible; MSIE 9.0; Windows  NT 10.0; WOW64; Trident/8.0; Touch)'}
    html = requests.get(url, headers=header)
    return html.text

#To get all the links in current page
def getallLinksinPage(sourceCode):
    bigClasses = re.findall('<th class="new">(.*?)</th>', sourceCode, re.S)
    allLinks = []
    for each in bigClasses:
        everylink = re.findall('</em><a href="(.*?)" onclick', each, re.S)[0]
        allLinks.append(everylink)
return allLinks

您在使用函數后定義函數,這樣您的代碼將出錯。 您也不應該使用re來解析html,而是使用如下的beautifulsoup這樣的解析器。 還可以使用urlparse.urljoin基礎網址加入到的鏈接,你真正想要的是與ID的股利標簽的HREFs threadlist

import requests
from bs4 import BeautifulSoup
from urlparse import urljoin

url = 'http://bbs.skykiwi.com/forum.php?mod=forumdisplay&fid=55&typeid=470&sortid=231&filter=typeid&pageNum=1&page=1'



def getsourse(url):
    header = {'User-Agent': 'Mozilla/5.0 (compatible; MSIE 9.0; Windows  NT 10.0; WOW64; Trident/8.0; Touch)'}
    html = requests.get(url, headers=header)
    return html.content

#To get all the links in current page
def getallLinksinPage(sourceCode):
    soup = BeautifulSoup(sourceCode)
    return [a["href"] for a in soup.select("#threadlist a.xst")]



sourceCode = getsourse(url) # source code of the url page
allLinksinPage = getallLinksinPage(sourceCode) #a List of the urls in current page
for eachLink in allLinksinPage:
    url = 'http://bbs.skykiwi.com/'
    html = getsourse(urljoin(url, eachLink))
    print(html)

如果在循環中打印urljoin(url, eachLink) ,您會看到獲得表的所有正確鏈接以及返回的正確源代碼,以下是返回的鏈接的片段:

http://bbs.skykiwi.com/forum.php?mod=viewthread&tid=3177846&extra=page%3D1%26filter%3Dtypeid%26typeid%3D470%26sortid%3D231%26typeid%3D470%26sortid%3D231
http://bbs.skykiwi.com/forum.php?mod=viewthread&tid=3197510&extra=page%3D1%26filter%3Dtypeid%26typeid%3D470%26sortid%3D231%26typeid%3D470%26sortid%3D231
http://bbs.skykiwi.com/forum.php?mod=viewthread&tid=3201399&extra=page%3D1%26filter%3Dtypeid%26typeid%3D470%26sortid%3D231%26typeid%3D470%26sortid%3D231
http://bbs.skykiwi.com/forum.php?mod=viewthread&tid=3170748&extra=page%3D1%26filter%3Dtypeid%26typeid%3D470%26sortid%3D231%26typeid%3D470%26sortid%3D231
http://bbs.skykiwi.com/forum.php?mod=viewthread&tid=3152747&extra=page%3D1%26filter%3Dtypeid%26typeid%3D470%26sortid%3D231%26typeid%3D470%26sortid%3D231
http://bbs.skykiwi.com/forum.php?mod=viewthread&tid=3168498&extra=page%3D1%26filter%3Dtypeid%26typeid%3D470%26sortid%3D231%26typeid%3D470%26sortid%3D231
http://bbs.skykiwi.com/forum.php?mod=viewthread&tid=3176639&extra=page%3D1%26filter%3Dtypeid%26typeid%3D470%26sortid%3D231%26typeid%3D470%26sortid%3D231
http://bbs.skykiwi.com/forum.php?mod=viewthread&tid=3203657&extra=page%3D1%26filter%3Dtypeid%26typeid%3D470%26sortid%3D231%26typeid%3D470%26sortid%3D231
http://bbs.skykiwi.com/forum.php?mod=viewthread&tid=3190138&extra=page%3D1%26filter%3Dtypeid%26typeid%3D470%26sortid%3D231%26typeid%3D470%26sortid%3D231
http://bbs.skykiwi.com/forum.php?mod=viewthread&tid=3140191&extra=page%3D1%26filter%3Dtypeid%26typeid%3D470%26sortid%3D231%26typeid%3D470%26sortid%3D231
http://bbs.skykiwi.com/forum.php?mod=viewthread&tid=3199154&extra=page%3D1%26filter%3Dtypeid%26typeid%3D470%26sortid%3D231%26typeid%3D470%26sortid%3D231
http://bbs.skykiwi.com/forum.php?mod=viewthread&tid=3156814&extra=page%3D1%26filter%3Dtypeid%26typeid%3D470%26sortid%3D231%26typeid%3D470%26sortid%3D231
http://bbs.skykiwi.com/forum.php?mod=viewthread&tid=3203435&extra=page%3D1%26filter%3Dtypeid%26typeid%3D470%26sortid%3D231%26typeid%3D470%26sortid%3D231
http://bbs.skykiwi.com/forum.php?mod=viewthread&tid=3089967&extra=page%3D1%26filter%3Dtypeid%26typeid%3D470%26sortid%3D231%26typeid%3D470%26sortid%3D231
http://bbs.skykiwi.com/forum.php?mod=viewthread&tid=3199384&extra=page%3D1%26filter%3Dtypeid%26typeid%3D470%26sortid%3D231%26typeid%3D470%26sortid%3D231
http://bbs.skykiwi.com/forum.php?mod=viewthread&tid=3173489&extra=page%3D1%26filter%3Dtypeid%26typeid%3D470%26sortid%3D231%26typeid%3D470%26sortid%3D231
http://bbs.skykiwi.com/forum.php?mod=viewthread&tid=3204107&extra=page%3D1%26filter%3Dtypeid%26typeid%3D470%26sortid%3D231%26typeid%3D470%26sortid%3D231

如果您在瀏覽器中訪問了上面的鏈接,則使用http://bbs.skykiwi.com/forum.php?mod=viewthread&amp;tid=3187289&amp;extra=page%3D1%26filter%3Dtypeid%26typeid%3D470%26sortid%3D231%26typeid%3D470%26sortid%3D231可以看到正確的頁面http://bbs.skykiwi.com/forum.php?mod=viewthread&amp;tid=3187289&amp;extra=page%3D1%26filter%3Dtypeid%26typeid%3D470%26sortid%3D231%26typeid%3D470%26sortid%3D231

Sorry, specified thread does not exist or has been deleted or is being reviewed
[New Zealand day-dimensional network Community Home]

您可以清楚地看到網址的區別。 如果您希望自己的工作,則需要在正則表達式中進行替換:

 everylink = re.findall('</em><a href="(.*?)" onclick', each.replace("&","%26"), re.S)[0]

但是真的不解析html會是一個正則表達式。

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM