简体   繁体   中英

BeautifulSoup findAll returns empty list when selecting class

findall() returns empty list when specifying class

Specifying tags work fine

import urllib2
from bs4 import BeautifulSoup

url = "https://www.reddit.com/r/Showerthoughts/top/?sort=top&t=week"

hdr = { 'User-Agent' : 'tempro' }
req = urllib2.Request(url, headers=hdr)
htmlpage = urllib2.urlopen(req).read()

BeautifulSoupFormat = BeautifulSoup(htmlpage,'lxml')
name_box = BeautifulSoupFormat.findAll("a",{'class':'title'})

for data in name_box:
    print(data.text)

I'm trying to get only the text of the post. The current code prints out nothing. If I remove the {'class':'title'} it prints out the post text as well as username and comments of the post which I don't want.

I'm using python2 with the latest versions of BeautifulSoup and urllib2

To get all the comments you are going to need a method like selenium which will allow you to scroll. Without that, just to get initial results, you can grab from a script tag in the requests response

import requests
from bs4 import BeautifulSoup as bs
import re
import json

headers = {'User-Agent' : 'Mozilla/5.0'}
r = requests.get('https://www.reddit.com/r/Showerthoughts/top/?sort=top&t=week', headers = headers)
soup = bs(r.content, 'lxml')
script = soup.select_one('#data').text
p = re.compile(r'window.___r = (.*); window')
data = json.loads(p.findall(script)[0])
for item in data['posts']['models']:
    print(data['posts']['models'][item]['title'])

The selector you try to use is not good, because you do not have a class = "title" for those posts. Please try this below:

name_box = BeautifulSoupFormat.select('a[data-click-id="body"] > h2')

this finds all the <a data-click-id="body"> where you have <h2> tag that contain the post text you need

More about selectors using BeatufulSoup you can read here: ( https://www.crummy.com/software/BeautifulSoup/bs4/doc/#css-selectors )

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM