简体   繁体   中英

Why am I not able to scrape till end of this HTML using Beautifulsoup?

I would like to scrape all the comments in this reddit page and then write the data into a csv file. I have written this code.

I noticed that all comments were 'div' elements with a class of 'RichTextJSON-root'.

Code written:

from bs4 import BeautifulSoup
import requests
import csv

# Reddit

source = requests.get(
    'https://www.reddit.com/r/sysadmin/comments/gjkqvj/who_has_made_the_switch_from_dell_to_lenovo/').text

soup = BeautifulSoup(source, 'html.parser')

countreddit = 0

csv_file = open('reddit_scrape.csv', 'w')
writer = csv.writer(csv_file)
writer.writerow(['Comment'])

comments = soup.find_all('div', class_='RichTextJSON-root')

for comment in comments:
    for para in comment.find_all('p'):
        paratext = para.text
        writer.writerow([paratext])
        countreddit += 1

print(f'Reddit Count: {countreddit}')
csv_file.close()

Link to reddit post: https://www.reddit.com/r/sysadmin/comments/gjkqvj/who_has_made_the_switch_from_dell_to_lenovo/

However, I have only managed to scrape till midway of the webpage, stopping at the comment 'Those using the WD15 docks would like to have a word with you.' May I ask how can I scrape till the end of this page?

I read on other posts that it could be because when beautifulsoup scrapes, it only scrapes what is rendered on the webpage and at this point in time, the webpage has not been rendered completely.

Thank you!

This URL will give you all the details (also comments) in JSON format.

https://www.reddit.com/r/sysadmin/comments/gjkqvj.json ?

I think you could parse that JSON file and extract the comments.

import requests
url = 'https://www.reddit.com/r/sysadmin/comments/gjkqvj.json?'

resp = requests.get(url)
json_string = resp.json()

You can use reddit JSON feature - load the page in json format (just add the .json at the end of URL):

import json
import requests

url = "https://old.reddit.com/r/sysadmin/comments/gjkqvj/who_has_made_the_switch_from_dell_to_lenovo/.json"
headers = {
    "User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:89.0) Gecko/20100101 Firefox/89.0"
}
data = requests.get(url, headers=headers).json()

#uncomment this to print all data:
#print(json.dumps(data, indent=4))

def get_comments(d):
    if isinstance(d, dict):
        for k, v in d.items():
            if k == "body":
                yield v
            else:
                yield from get_comments(v)
    elif isinstance(d, list):
        for v in d:
            yield from get_comments(v)


for c in get_comments(data):
    print(c)
    print("-" * 80)

Prints:

And all that sucked was the clunkpad. Replace with a 50 series trackpad and you’re good to go.
--------------------------------------------------------------------------------
except the 4th-generation core series. Those models don't exist.
--------------------------------------------------------------------------------
They've been good...forever
--------------------------------------------------------------------------------
Well, a little bit. But of course, there is always a room at the bottom (looking at a half eaten Apple).
--------------------------------------------------------------------------------
Yes, they sure did, enough people complained and they learned their lesson. At least they listen, they *could* be Apple. ;)
--------------------------------------------------------------------------------
Well, they did with the 440/540 series when they slashed all Thinkpad features (bento box, leds, Thinklight, replacing the mousepad with this stupid "glass touchpad") away to make those laptops look more "mainstream". A few series after they had to bring back a few features because they went too far.
--------------------------------------------------------------------------------

...

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM