简体   繁体   English

爬网Python时出错

[英]Error while crawling web python

when I try to run the code below this error was returned. 当我尝试运行下面的代码时,返回此错误。 I'd be much appreciated if someone can help to point out where I did wrong. 如果有人可以帮助指出我做错了什么,我将不胜感激。 Thank you. 谢谢。

Traceback (most recent call last):
  File "web_crawler.py", line 26, in <module>
    links = get_all_links(page)
  File "web_crawler.py", line 14, in get_all_links
    url, endpos = get_next_target(page)
  File "web_crawler.py", line 2, in get_next_target
    start_link = page.find("<a href=")
TypeError: a bytes-like object is required, not 'str'

def get_next_target(page):
    start_link = page.find("<a href=")
    if start_link == -1:
        return None, 0
    start_quote = page.find('"',start_link)
    end_quote = page.find('"',start_quote+1)
    url = page[start_quote+1:end_quote]
    print(url)
    return url, end_quote

def get_all_links(page):
    links = []
    while True:
        url, endpos = get_next_target(page)
        if url:
            links.append(url)
            page = page[endpos:]
        else:
            break
    return links

import requests
url='https://en.wikipedia.org/wiki/Moon'
r = requests.get(url)
page = r.content
links = get_all_links(page)

response.content is the raw contents of the request. response.content是请求的原始内容。 They are not decoded it or anything, it's just the raw bytes. 他们没有被解码或其他任何东西,只是原始字节。

What you want to use instead is the response.text attribute, which contains the decoded content as a string. 您要使用的是response.text属性,该属性包含已解码的内容作为字符串。

(You also probably want to use an html parsing library like BeautifulSoup instead of your current page.find approach) (您可能还想使用像BeautifulSoup这样的html解析库,而不是当前的page.find方法)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM