Comment by safrax on 22/12/2022 at 20:34 UTC

2 upvotes, 3 direct replies (showing 3)

View submission: Update on COLO switchover -- bug fixes, reindexing and more

View parent comment

Scores are inaccurate in Pushshift due to the way Pushshift works: It pulls something once and then never again.* If you look at scores within the last month the majority will likely be around 1, some may be over that if ingest got behind but it'll still be wrong.

PRAW is the solution here.

Replies

Comment by mbtcworld22 at 27/12/2022 at 05:42 UTC

2 upvotes, 1 direct replies

Yes, but another limitation for PRAW is the 1000 limit. I needed more than 1000 top posts of a subreddit.

Is there currently a way to filter the results by score in PRAW? That would make my project doable since pushshift is still unavailable for now.

Comment by Academic-Rent7800 at 23/12/2022 at 20:58 UTC

1 upvotes, 1 direct replies

Is that the case for the latest Push Shift version too (https://api.pushshift.io/redoc#operation/search%5C_reddit%5C_posts%5C_reddit%5C_search%5C_submission%5C_get[1][2])? I was looking at the 'Search Reddit Post' query parameters and thought I could filter by `max_score`

1: https://api.pushshift.io/redoc#operation/search%5C_reddit%5C_posts%5C_reddit%5C_search%5C_submission%5C_get

2: https://api.pushshift.io/redoc#operation/search_reddit_posts_reddit_search_submission_get

Comment by Academic-Rent7800 at 23/12/2022 at 21:07 UTC

1 upvotes, 1 direct replies

While going over the Pushshift paper, "The Pushshift Reddit Dataset" I found this -

"In this paper, we present the Pushshift Reddit dataset.

Pushshift is a social media data collection, analysis, and

archiving platform that since 2015 has collected Reddit

data and made it available to researchers. Pushshift’s Reddit

dataset is updated in real-time, and includes historical data

back to Reddit’s inception."