Parsing HTML with requests and BeautifulSoup

Question

I'm not sure if I'm approaching this correctly. I'm using requests to make a GET:

con = s.get(url)

when I call con.content, the whole page is there. But when I pass con into BS:

soup = BeautifulSoup(con.content)
print(soup.a)

I get none. There are lots of tags in there, not behind any JS, that are preset when i call con.content, but when I try to parse with BS most of the page is not there.

Answer 1

Change the parser to html5lib

pip install html5lib

And then,

soup = BeautifulSoup(con.content,'html5lib')

Answer 2

The a tags are probably not on the top level.

soup.find_all('a')

is probably what you wanted.

In general, I found lxml to be more reliable, consistent in the API and faster. Yes, even more reliable - I have repeatedly had documents where BeautifulSoup failed to parse them, but lxml in its robust mode lxml.html.soupparser still worked well. And there is the lxml.etree API which is really easy to use.

Answer 3

Without being able to see you're html you're getting I just did this on the hacker news site and it returns all the a tags as expected.

import requests
from bs4 import BeautifulSoup

s = requests.session()

con = s.get('https://news.ycombinator.com/')

soup = BeautifulSoup(con.text)

links = soup.findAll('a')

for link in links:
    print link

Parsing HTML with requests and BeautifulSoup

Question

3 answers

solution1
2 ACCPTED 2014-12-26 00:29:57

solution2
1 2014-12-17 23:47:10

solution3
0 2014-12-18 02:37:48

Parsing HTML with requests and BeautifulSoup

Question

3 answers

solution1 2 ACCPTED 2014-12-26 00:29:57

solution2 1 2014-12-17 23:47:10

solution3 0 2014-12-18 02:37:48

solution1
2 ACCPTED 2014-12-26 00:29:57

solution2
1 2014-12-17 23:47:10

solution3
0 2014-12-18 02:37:48