[英]Want to get all links in a webpage using urllib.request
When I test it, it keeps printing out (None, 0) even though the url I used has several < a href=当我测试它时,即使我使用的 url 有几个 < a href=
import urllib.request as ur
def getNextlink(url):
sourceFile = ur.urlopen(url)
sourceText = sourceFile.read()
page = str(sourceText)
startLink = page.find('<a href=')
if startLink == -1:
return None, 0
startQu = page.find('"', startLink)
endQu = page.find('"', startQu+1)
url = page[startQu +1:endQu]
return url, endQu
You should use beautiful soup instead it works pretty smoothly along with requests for your requirement.你应该用漂亮的汤代替它,它可以很好地满足你的要求。 I will give an example below:
我将在下面举一个例子:
from bs4 import BeautifulSoup
import requests
def links(url):
html = requests.get(url).content
bsObj = BeautifulSoup(html, 'lxml')
links = bsObj.findAll('a')
finalLinks = set()
for link in links:
finalLinks.add(link.attrs['href'])
Try This试试这个
import urllib.request
导入 urllib.request
import re
进口重新
#pass any url url = " Want to get all links in a webpage using urllib.request "
#pass any url url = " 想要使用 urllib.request 获取网页中的所有链接"
urllist = re.findall(r"""<\\s*a\\s*href=["']([^=]+)["']""", urllib.request.urlopen(url).read().decode("utf-8"))
urllist = re.findall(r"""<\\s*a\\s*href=["']([^=]+)["']""", urllib.request.urlopen(url).read( ).decode("utf-8"))
print(urllist)
打印(网址列表)
Here is another solution:这是另一个解决方案:
from urllib.request import urlopen
url = ''
html = str(urlopen(url).read())
for i in range(len(html) - 3):
if html[i] == '<' and html[i+1] == 'a' and html[i+2] == ' ':
pos = html[i:].find('</a>')
print(html[i: i+pos+4])
Define your url.定义您的网址。 Hope this helps and don't forget to up vote and accept.
希望这会有所帮助,不要忘记投票并接受。
How about one of these solutions?这些解决方案之一怎么样?
import requests
from bs4 import BeautifulSoup
research_later = "giraffe"
goog_search = "https://www.google.co.uk/search?sclient=psy-ab&client=ubuntu&hs=k5b&channel=fs&biw=1366&bih=648&noj=1&q=" + research_later
r = requests.get(goog_search)
print r
soup = BeautifulSoup(r.text, "html.parser")
print soup
import requests
from bs4 import BeautifulSoup
r = requests.get("http://www.flashscore.com/soccer/netherlands/eredivisie/results/")
soup = BeautifulSoup(r.content)
htmltext = soup.prettify()
print htmltext
import sys,requests,csv,io
from bs4 import BeautifulSoup
from urllib.parse import urljoin
url = "http://www.cricbuzz.com/cricket-stats/icc-rankings/batsmen-rankings"
r = requests.get(url)
r.content
soup = BeautifulSoup(r.content, "html.parser")
maindiv = soup.find_all("div", {"class": "text-center"})
for div in maindiv:
print(div.text)
Sometimes BeautifulSoup and requests is not what you want to use.有时 BeautifulSoup 和 requests 不是您想要使用的。
In some cases when using requests library you can be prevented by the website in question from scraping (get a response 403).在某些情况下,当使用请求库时,您可能会被相关网站阻止抓取(获得响应 403)。 So you have to use urllib.request instead.
所以你必须改用 urllib.request 。
Here is how you can get all links (hrefs) listed on a webpage that you are trying to scrape using urllib.request.以下是您如何使用 urllib.request 获取您尝试抓取的网页上列出的所有链接 (href)。
import urllib.request
from urllib.request import urlretrieve, Request, urlopen
import re
# get full html code from a website
response = Request('https://www.your_url.com', headers={'User-Agent': 'Mozilla/5.0'})
webpage = urlopen(response)
print(webpage.read())
# create a list of all links/href tags
url = 'https://www.your_url.com'
urllist = re.findall("href=[\"\'](.*?)[\"\']", urllib.request.urlopen(url).read().decode("utf-8"))
print(urllist)
# print each link on a seperate line
for elem in urllist:
print(elem)
In the code we use str.decode(x) with the chosen plaintext encoding x to convert HTML object to a plaintext string.在代码中,我们使用 str.decode(x) 和选择的明文编码 x 将 HTML 对象转换为明文字符串。 Standard encoding is utf-8.
标准编码为 utf-8。 You may need to change encoding if a website that you are trying to scrape uses diffrent encoding.
如果您尝试抓取的网站使用不同的编码,您可能需要更改编码。
We find links with the help of regular expressions: Call re.findall(pattern,string) with the regular expression pattern href=\\"\\'[\\"\\'] on the plaintext string to match on all href tags but only extract the url text that follows in quotations to return a list of links contained inside href tags.我们在正则表达式的帮助下找到链接:在明文字符串上使用正则表达式模式 href=\\"\\'[\\"\\'] 调用 re.findall(pattern,string) 以匹配所有 href 标签,但只提取url 用引号括起来的文本,以返回包含在 href 标签内的链接列表。
Try it with the request-html which can parse the HTML anf we can search any tag, cladd or ID in HTML使用可以解析 HTML 的 request-html 尝试一下,我们可以在 HTML 中搜索任何标签、包层或 ID
from requests_html import HTMLSession
session = HTMLSession()
r = session.get(url)
r.html.links
if you want the absolute links use如果你想使用绝对链接
r.html.absolute_links
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.