简体   繁体   中英

What is the best way to fetch all data (posts and their comments) from Reddit?

I have a requirement to analyze all the comments about a subreddit, eg r/dogs, say from 2015 onwards. I want to fetch the json and store it in mongodb so that I can parse through the data whenever required. I checked PRAW but there is no option to get all the comments together in json. I could only find ways to get top submissions.

Is there any way to code this?

Thanks !

You can use the Python Pushshift.io API Wrapper (PSAW) to get all the most recent submissions and comments from a specific subreddit, and can even do more complex queries (such as searching for specific text inside a comment). The docs are available here.

For example, you can use the get_submissions() function to get the top 1000 submissions from r/dogs from 2015:

import datetime as dt
import praw
from psaw import PushshiftAPI

r = praw.Reddit(...)
api = PushshiftAPI(r)

start_epoch=int(dt.datetime(2015, 1, 1).timestamp()) # Could be any date

submissions_generator = api.search_submissions(after=start_epoch, subreddit='dogs', limit=1000) # Returns a generator object
submissions = list(submissions_generator) # You can then use this, store it in mongoDB, etc.

Alternatively, to get the first 1000 comments from r/dogs in 2015, you can use the search_comments() function:

start_epoch=int(dt.datetime(2015, 1, 1).timestamp()) # Could be any date
    
comments_generator = api.search_comments(after=start_epoch, subreddit='dogs', limit=1000) # Returns a generator object
comments = list(comments_generator)

As you can see, PSAW still uses PRAW, and so returns PRAW objects for submissions and comments, which may be handy.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM