简体   繁体   中英

Parsing HTML with requests and BeautifulSoup

I'm not sure if I'm approaching this correctly. I'm using requests to make a GET:

con = s.get(url)

when I call con.content, the whole page is there. But when I pass con into BS:

soup = BeautifulSoup(con.content)
print(soup.a)

I get none. There are lots of tags in there, not behind any JS, that are preset when i call con.content, but when I try to parse with BS most of the page is not there.

Change the parser to html5lib

pip install html5lib

And then,

soup = BeautifulSoup(con.content,'html5lib')

The a tags are probably not on the top level.

soup.find_all('a')

is probably what you wanted.

In general, I found lxml to be more reliable, consistent in the API and faster. Yes, even more reliable - I have repeatedly had documents where BeautifulSoup failed to parse them, but lxml in its robust mode lxml.html.soupparser still worked well. And there is the lxml.etree API which is really easy to use.

Without being able to see you're html you're getting I just did this on the hacker news site and it returns all the a tags as expected.

import requests
from bs4 import BeautifulSoup

s = requests.session()

con = s.get('https://news.ycombinator.com/')

soup = BeautifulSoup(con.text)

links = soup.findAll('a')

for link in links:
    print link

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM