Update on COLO switchover -- bug fixes, reindexing and more

https://www.reddit.com/r/pushshift/comments/zkggt0/update_on_colo_switchover_bug_fixes_reindexing/

created by Stuck_In_the_Matrix on 13/12/2022 at 00:10 UTC

86 upvotes, 29 top-level comments (showing 25)

There were a few problems with the December mapping (specifically, Reddit Submission ids are now larger than the largest possible int value in the ES mapping). This meant we were missing a lot of December comments over the past day or two.

I have fixed that mapping issue (int -> long) and I am reloading all of December comments. This should be completed in about two hours.

Also, I'm going through the fields like subreddit_id, link_id, etc. and making sure they are base36 ids like the old API and not ints. This should be completed tonight as well.

We're going through the bug reports many of you have graciously provided and will be fixing a bunch of them over the next day.

Again, thank you all for your help and patience. The end result from all of this will be a much more robust and stable API with higher rate limits for everyone (probably 2-5 per second based on load). The new hardware can handle a lot more than the older hardware could.

I will keep you all updated but this will probably be my last post for this evening.

Comments

Comment by s_i_m_s at 19/12/2022 at 18:49 UTC*

1 upvotes, 1 direct replies

Going to try and keep track of all the main breaking changes/bugs/notable changes here.

Breaking changes

Metadata/total results

`"total_results": 28462`

The new api now returns a cheaper estimate count of results by default but in many applications the count is the only part you want.

Will need to add `&track_total_hits=true` to the query to get a real count, otherwise for large queries the estimate will max out at 10000.

Will need to be updated to find the total results in a different section as it now looks like `{"total":{"value":28462,"relation":"eq"}`

~~PMAW uses the field in it's pagination process and needs to be updated to use the new field to work properly among other changes, IIUC there are a couple of pull requests on the github page that bypass the field but none that adapt it to use the new field yet. PMAW should be updated this week.[1] - 2022-12-19~~ PMAW has been updated for the API changes 2022-12-24

1: https://www.reddit.com/r/pushshift/comments/zovqr9/pushshift_appears_to_return_0_results/j0swmxg/

--------------------------------------------------------------------------------

`after` and `before` no longer accepts YYYY-MM-DD, support could still be added later but at least for now it's not.

--------------------------------------------------------------------------------

Sort/order

`sort` is now `order` and `sort_type` is now `sort` so it's unlikely to be fixed with an alias later

--------------------------------------------------------------------------------

/meta

The meta page no longer exists but SITM had not been updating it anyway. The intent was to have a dynamic page where clients like PSAW could get the current rate limit but SITM never updated it.

PSAW requires some modification to work around the changes

https://www.reddit.com/r/pushshift/comments/zlryw1/ive_been_getting_response_status_code_404_since/j0bss25/[2]

Otherwise PSAW is no longer maintained and the github page recommends using PMAW instead, I was not able to find any active forks.

2: https://www.reddit.com/r/pushshift/comments/zlryw1/ive_been_getting_response_status_code_404_since/j0bss25/

--------------------------------------------------------------------------------

The `https://api.pushshift.io/reddit/search` comment search endpoint is no longer functional, move to `https://api.pushshift.io/reddit/comment/search` or `https://api.pushshift.io/reddit/search/comment`

May still be aliased into being functional again later but seems unlikely as the other endpoints are much more intuitive at a glance.

--------------------------------------------------------------------------------

`full_link` is no longer included in submission results, suggest building url via `permalink` - 2022-12-26

--------------------------------------------------------------------------------

It is no longer possible to sort submissions by `num_comments` considering we're supposed to be getting aggs back once all of this is working again I think this is just an oversight on SITMs part rather than an intentional change but with so much else broken i'm not going to ask about it until I start seeing some of this being fixed 2022-12-31

--------------------------------------------------------------------------------

Searching by `url` doesn't work, this is not listed in any current documentation I can find so it may no longer be supported or it could just be something that got left out by accident. Will check after things start getting fixed. -- 2023-01-19

--------------------------------------------------------------------------------

Bugs

size is supposed to be aliased to limit but doesn't work the same

size=0 returns 10 results

limit=0 returns 0

--------------------------------------------------------------------------------

author search has problems with dashes.

author search is now contains rather than an exact match.

--------------------------------------------------------------------------------

subreddit search has similar problems to author search and appears to be returning results as contains rather than exact match. As an example https://api.pushshift.io/reddit/search/submission?subreddit=science&author=science[3] is returning results from user self post subreddits like u/Inner-Science-5658 - 2023-02-01

3: https://api.pushshift.io/reddit/search/submission?subreddit=science&author=science

--------------------------------------------------------------------------------

~~submission search currently only goes back like 45 days, the data isn't there, it's supposed to be loaded from the old API this week - 2022-12-19 submissions are slowly being reloaded from the beginning currently there is a gap from 2022-01-09 to 2022-11-03. Minibug made a page to track the progress here[4] - 2023-03-29~~

Back submissions reloading appears to be complete as of 2023-04-06

4: https://minibug1021.github.io/pushshift.html

--------------------------------------------------------------------------------

`fields` is now `filter` although this is supposed to be aliased so either works later.

--------------------------------------------------------------------------------

redditsearch.io is now broken entirely, well it still loads but the search function doesn't work, the comment search had already been broken for a while and now the submission search doesn't work either.

Suggest using one of the other maintained front ends like;

https://camas.unddit.com/[5]

~~https://redditsearchtool.com/~~[6] broken by an API change resulting in a redirect 2023-01-05 https://adhesivecheese.github.io/chearch/[7]

5: https://camas.unddit.com/

6: https://redditsearchtool.com/~~

7: https://adhesivecheese.github.io/chearch/

--------------------------------------------------------------------------------

`!` negation no longer works, suggest using `-` instead~~, not sure if intended change or bug~~. Neither works on author or subreddit searches, ~~seems like a bug.~~ --confirmed bug 2022-12-21.

--------------------------------------------------------------------------------

querying `link_id` is only working in base 10 format[8] instead of the normal base 36 - 2023-01-07

8: https://www.reddit.com/r/pushshift/comments/103k1qe/anyone_have_luck_using_the_link_id_param_in_the/j2zyjkp/

--------------------------------------------------------------------------------

api is giving parent_ids for comments in base 10 instead of base 36 -- 2023-01-12

--------------------------------------------------------------------------------

Notable changes

The `metadata=true` flag seems to be ignored now and is always enabled regardless of setting.

--------------------------------------------------------------------------------

`until` is the new `before` and `since` is the new `after` but both seem to be functional.

New API documentation.

https://api.pushshift.io/redoc

and

https://api.pushshift.io/docs

If it's not here i've missed it, please let me know. I aim for this to be a comprehensive list.

Comment by pacman_sl at 14/12/2022 at 19:26 UTC

13 upvotes, 2 direct replies

It seems to me that there are some breaking changes to the API and I'm surprised to see them unannounced:

Comment by GreatBlitz at 23/12/2022 at 11:11 UTC

12 upvotes, 0 direct replies

Any idea when we might get access to posts older than a month?

Comment by Postpone-Grant at 14/12/2022 at 20:42 UTC

9 upvotes, 1 direct replies

Bug report:

Using the `author` parameter on the `/reddit/search/submission` does not perform an equal search. It seems to perform a LIKE search.

For instance, searching using my username Postpone-Grant will return submissions for users with similar usernames, such as Grant-James_River282 or Grant-McDonald.

Instead, that endpoint should only return submissions for the exact provided author.

Thanks!

Comment by sexyrexy2185 at 13/12/2022 at 02:05 UTC

8 upvotes, 0 direct replies

You fucking rock man. Good Luck with the switchover.

Comment by jmcgomes at 13/12/2022 at 18:14 UTC

8 upvotes, 0 direct replies

I seem to have the same issue with submissions older than Nov 3. I can find comments though.

Is this some bug that will be fixed? Or some permanent data loss?

Ex (Oct 10 to Oct 11):

https://api.pushshift.io/reddit/search/submission?subreddit=askreddit&before=1665446400&after=1665360000[1][2] --> Returns nothing

1: https://api.pushshift.io/reddit/search/submission?subreddit=askreddit&before=1665446400&after=1665360000

2: https://api.pushshift.io/reddit/search/submission?subreddit=askreddit&before=1665446400&after=1665360000

https://api.pushshift.io/reddit/search/comment?subreddit=askreddit&before=1665446400&after=1665360000[3][4] --> Returns plenty

3: https://api.pushshift.io/reddit/search/comment?subreddit=askreddit&before=1665446400&after=1665360000

4: https://api.pushshift.io/reddit/search/comment?subreddit=askreddit&before=1665446400&after=1665360000

If I take one submission ID from those comments returned, ex: y0sstw, and try to get it directly:

https://api.pushshift.io/reddit/search/submission?ids=2057193716[5][6] --> Doesn't find it

5: https://api.pushshift.io/reddit/search/submission?ids=2057193716

6: https://api.pushshift.io/reddit/search/submission?ids=2057193716

Also note that I had to manually convert the base36 link_id to int. Passing the base36 ID results in Internal Server Error. I assume this is also a bug.

Comment by ExcitingishUsername at 15/12/2022 at 06:06 UTC*

7 upvotes, 0 direct replies

Some significant bugs seem to have been introduced during the migration; most notably, it no longer appears to be possible to exclude multiple authors (and, as another commenter pointed out, the author names themselves are not being properly matched either[1]). Both of these completely break our analytics in a way that doesn't seem to be practical to work-around (we'd need to retrieve hundreds of extra pages in some instances). For example, `author=!AutoModerator,!SomeOtherBot` would previously exclude both those accounts, but now it doesn't exclude either of them. If I'm reading the metadata correctly, this is because it's matching "any" of these conditions, which of course doesn't make sense when trying to exclude things.

1: /r/pushshift/comments/zkggt0/update_on_colo_switchover_bug_fixes_reindexing/j08gfqt/

Additionally, are the `unique`, `before_id`/`after_id`, and `distinguished` parameters functional, are there examples of how these are supposed to be used? They have never worked for me at all even before the migration, though it is possible I am just using them wrong (or even that the documentation is wrong or unclear).

Finally, is `metadata=false` not the correct way to turn off metadata? It seems to be on by default now, and it seems wasteful to be returning this in cases we aren't going to be using it.

Comment by Furrystonetoss at 13/12/2022 at 15:18 UTC

6 upvotes, 0 direct replies

the api is still down, it keeps returning zero results. a shame, as i wanted to use this api, to get content of a banned user and his removed post, a few days ago.

Comment by Agitated-Bee4055 at 13/12/2022 at 17:23 UTC

6 upvotes, 0 direct replies

only posts after 11-3 any older then that ??

Comment by That1BlackGuy at 31/01/2023 at 04:59 UTC

4 upvotes, 0 direct replies

Any idea when we might see pre-Nov 3 data loaded?

Comment by gurnec at 13/12/2022 at 02:06 UTC*

3 upvotes, 1 direct replies

Great news!

Quick question, what is the preferred avenue for future bug ~~requests~~ reports? (sorry that was a weird typo)

Comment by n-e-i-b at 14/12/2022 at 22:17 UTC*

3 upvotes, 2 direct replies

Hi

"total_results" is no longer returned in metadata.

There is a "total" field but it's limited to the default ElasticSearch value : 10 000

Edit : I tried to add "&track_total_hits=true" in the url. Seems to work better, but a lot less results than before. But maybe the reindexing is still processing

Comment by sc00p at 17/12/2022 at 13:48 UTC*

3 upvotes, 1 direct replies

There hasn't been any new data for the last 4 days... Should I change something to my current extractor?

Edit:

I found out that this might be because of two reasons:

Edit: After removing the filter paremeter and changing the before/after, I cannot get this working. PRAW returns 'max entries exceeded'. Will continue troubleshooting later.

Comment by No_One_3701 at 18/12/2022 at 20:13 UTC

3 upvotes, 2 direct replies

I still cannot scrape anything older than November 3, 2022. Anyone has an idea why?

Comment by mbtcworld22 at 22/12/2022 at 03:27 UTC

3 upvotes, 1 direct replies

Are the results still just one month old? When can we start getting the old data?

Comment by YoureNotSpeshul at 13/12/2022 at 02:27 UTC

2 upvotes, 0 direct replies

Great job! Seriously, this is tremendous.

Comment by Agitated-Bee4055 at 14/12/2022 at 06:00 UTC

2 upvotes, 1 direct replies

submissions "**selftext**" return "**[removed]**" but the post is **good** and not removed

not all submissions*

Comment by [deleted] at 14/12/2022 at 08:42 UTC

2 upvotes, 1 direct replies

[removed]

Comment by i_Killed_Reddit at 15/12/2022 at 09:57 UTC

2 upvotes, 0 direct replies

Can wait a little longer, if this is going to be an upgrade. Awesome work man.

Comment by kjjejones42 at 31/12/2022 at 11:05 UTC

2 upvotes, 1 direct replies

Is it still possible to sort Pushshift submission search results by the number of comments? The "num_comments" option isn't listed at api.pushshift.io/redoc[1]. Is this a bug or has the functionality been removed?

1: https://api.pushshift.io/redoc

Comment by Only_Ad_1230 at 02/01/2023 at 18:38 UTC

2 upvotes, 3 direct replies

Hi, Based on the comments below, I see the API seems to be working for some of you. But, I consistently see 'Timeout' errors when I try to us the API to get any data either through the Web or through using PSAW or PMAW.

Both the below seems to fail. Can you please let me know if I am missing something here?

https://api.pushshift.io/reddit/comment/search/?q=science[1][2]

1: https://api.pushshift.io/reddit/comment/search/?q=science

2: https://api.pushshift.io/reddit/comment/search/?q=science

https://api.pushshift.io/reddit/search/comment?q=science[3][4]

3: https://api.pushshift.io/reddit/search/comment?q=science

4: https://api.pushshift.io/reddit/search/comment?=science

Comment by Familiar-Temporary67 at 06/02/2023 at 21:06 UTC

2 upvotes, 0 direct replies

Any update? I cannot use the API to get data. It return 0.

Comment by Undescended_tester at 01/05/2023 at 13:15 UTC

2 upvotes, 0 direct replies

Any news? It's been 5 months of pretty much radio silence...

Comment by Hynauts at 15/12/2022 at 16:40 UTC*

1 upvotes, 1 direct replies

2a0a915d9e42fa32768d7772c2fd3814ce1b5857492e0630ddbd82af8231e2fb

Comment by [deleted] at 19/12/2022 at 00:53 UTC

1 upvotes, 1 direct replies

[deleted]