简体   繁体   中英

BeautifulSoup doesn't seem to parse anything

I've been trying to learn BeautifulSoup by making myself a proxy scraper and I've encountered a problem. BeautifulSoup seems unable to find anything and when printing what it parses, It shows me this :

<html>
 <head>
 </head>
 <body>
  <bound 0x7f977c9121d0="" <http.client.httpresponse="" at="" httpresponse.read="" method="" object="" of="">
&gt;
  </bound>
 </body>
</html>

I have tried changing the website I parsed and the parser itself (lxml, html.parser, html5lib) but nothing seems to change, no matter what I do I get the exact same result. Here's my code, can anyone explain what's wrong ?

from bs4 import BeautifulSoup
import urllib
import html5lib

class Websites:

    def __init__(self):
        self.header = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36"}

    def free_proxy_list(self):
        print("Connecting to free-proxy-list.net ...")

        url = "https://free-proxy-list.net"
        req = urllib.request.Request(url, None, self.header)
        content = urllib.request.urlopen(req).read
        soup = BeautifulSoup(str(content), "html5lib")

        print("Connected. Loading the page ...")

        print("Print page")
        print("")
        print(soup.prettify())

You are calling urllib.request.urlopen(req).read , correct syntax is: urllib.request.urlopen(req).read() also you are not closing the connection, fixed that for you.

A better way to open connections is using the with urllib.request.urlopen(url) as req : syntax as this closes the connection for you.

from bs4 import BeautifulSoup
import urllib
import html5lib

class Websites:

    def __init__(self):
        self.header = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36"}

    def free_proxy_list(self):
        print("Connecting to free-proxy-list.net ...")

        url = "https://free-proxy-list.net"
        req = urllib.request.Request(url, None, self.header)
        content = urllib.request.urlopen(req)
        html = content.read()
        soup = BeautifulSoup(str(html), "html5lib")

        print("Connected. Loading the page ...")

        print("Print page")
        print("")
        print(soup.prettify())
        content.close()  # Important to close the connection

For more info see: https://docs.python.org/3.0/library/urllib.request.html#examples

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM