2 upvotes, 3 direct replies (showing 3)
View submission: Update on COLO switchover -- bug fixes, reindexing and more
Scores are inaccurate in Pushshift due to the way Pushshift works: It pulls something once and then never again.* If you look at scores within the last month the majority will likely be around 1, some may be over that if ingest got behind but it'll still be wrong.
PRAW is the solution here.
Comment by mbtcworld22 at 27/12/2022 at 05:42 UTC
2 upvotes, 1 direct replies
Yes, but another limitation for PRAW is the 1000 limit. I needed more than 1000 top posts of a subreddit.
Is there currently a way to filter the results by score in PRAW? That would make my project doable since pushshift is still unavailable for now.
Comment by Academic-Rent7800 at 23/12/2022 at 20:58 UTC
1 upvotes, 1 direct replies
Is that the case for the latest Push Shift version too (https://api.pushshift.io/redoc#operation/search%5C_reddit%5C_posts%5C_reddit%5C_search%5C_submission%5C_get[1][2])? I was looking at the 'Search Reddit Post' query parameters and thought I could filter by `max_score`
2: https://api.pushshift.io/redoc#operation/search_reddit_posts_reddit_search_submission_get
Comment by Academic-Rent7800 at 23/12/2022 at 21:07 UTC
1 upvotes, 1 direct replies
While going over the Pushshift paper, "The Pushshift Reddit Dataset" I found this -
"In this paper, we present the Pushshift Reddit dataset.
Pushshift is a social media data collection, analysis, and
archiving platform that since 2015 has collected Reddit
data and made it available to researchers. Pushshift’s Reddit
dataset is updated in real-time, and includes historical data
back to Reddit’s inception."