💾 Archived View for dioskouroi.xyz › thread › 29395096 captured on 2021-11-30 at 20:18:30. Gemini links have been rewritten to link to archived content
➡️ Next capture (2021-12-03)
-=-=-=-=-=-=-
________________________________________________________________________________
We’ve just moved to Snowflake. I haven’t really been impressed with some of the new features added to Redshift, it seems like too little and too late.
The JSON support (SUPER type) is kind of cool, and they are moving towards more “automatic” sorting + partitioning, but it’s just all a bit shit to be honest.
We encountered major bugs with data-sharing, our clusters keep insisting that zstd is the best compression format to use for all our data (but then never actually using it), materialised views often fail to update and understanding why is a nightmare, terrible performance if your strings are varchar(max) (guess what Glue sets them to…), Redshift data often just dies (4 hour downtime recently, no status page) and has some really weird semantics around listing queries, before the data API you couldn’t run async queries and it’s eventbridge integration straight up doesn’t work, nightmare bugs in the Java connection library that don’t show up using psql, tiny set of types (no arrays, uuids), unkillable queries, AQUA actually causing everything to slow down hugely, critical release notes posted only in a fucking random forum, etc etc.
Snowflake has apparently sorted this, as well as including ingestion tools (snowpipe) that you’d otherwise have to stitch together with AWS Glue or something (a cursed service if ever there was one).
That being said, in some cases Redshift absolutely flies. But the real world isn’t filled with ideal schemas and natural sort keys. It’s messy. And Snowflake deals with messy better.
I've had terrible experiences with Snowflake "automatic" optimizations. You have zero visibility or control over partitioning, join strategy, or anything else compared to Spark/Databricks.
> You have zero visibility or control over partitioning
snowflake give you visiblity into clustering [1] and in the query profile view you can see how pruning is working( or not working)
Can you give an example of what you visibility you would like to see in terms of partitioning?
1.
https://docs.snowflake.com/en/sql-reference/functions/system...
Being able to see exactly which columns are partitioned on so I know whether my query is going to be efficient or require full scan+materialization. I understand from your link Snowflake has clustering keys but it seems like this only "suggests" that Snowflake should partition on that key. Also with Spark I can look at the query plan while it's executing and figure out what part of my query is bottlenecking. Seems like Snowflake requires a query to complete for me to do that.
What do you mean "too late"? Redshift was released in 2012.
Edit for future readers: the original comment was "I haven’t really been impressed with Redshift, it seems like too little and too late".
Sorry, I tend to abbreviate too much. Given the context of the article (new sexy cool redshift features ripped from Snowflake) I meant to convey “the direction redshift is heading in seems too little too late”.
Oh ok. I don't know Snowflake so I had no idea this new feature was a (late) response to Snowflake.
This article was released 30 NOV 2021. So 9 years,
given snowflake also uses s3. Could aws do something with redshift that moves computation closer to s3 and be even faster than snowflake. Basically take advantage of the fact that they own s3.
they already have
https://aws.amazon.com/redshift/features/aqua/
basically distributing compute down to the actual storage nodes
Yeah that’s cool except it made all our queries slower. I’ll add that to my list above, forgot about that.
I wish Amazon would stop naming things "serverless" that clearly have a well defined type and number of servers at any point. That includes Redshift Serverless and Aurora Serverless. If it has a cluster, it isn't serverless it is just autoscaling. Every time they announce a serverless product I'm assuming that it will be like Lambda and am mostly disappointed. For example, a real Aurora Serverless would be more like CockroachDB Cloud or DynamodDB. And a real Redshift Serverless would be more like BigQuery.
I have worked with several companies that have their infrastucture on AWS but consider BigQuery or Snowflake for the serverless model they provide. This brings RedShift much closer to those options. I envision Redshift Serverless becoming the default options for most enterprises, mainly because it stays in the AWS ecosystem and you don't have to work with a different vendor and create different cost governance processes.
I beleive the real advantage AWS has here is in cost. Snowflake has positioned itself as price competitive with Redshift but this is primarily due to Snowflake's ability to scale on-demand, whereas prior Redshift versions required you to size for peak usage (RA3 helped with this). In my experience Snowflake is an order of magnitude more expensive if you compare similiar workloads and do not account for idle time. We will need to see the performance of a "Redshift Processing Unit" to be sure of the advantage, but even so AWS will be able provide significant downward cost pressure through this offering.
> _In my experience Snowflake is an order of magnitude more expensive if you compare similiar workloads and do not account for idle time._
Cost reasons is why I'm most bullish about DataBricks's FOSS
I was initially excited about this as I think it might solve our Redshift pain points and potentially avoid us having to deal with a migration to Snowflake but then I remembered when AWS account managers promised Athena and Spectrum would solve these same problems at a previous company I worked for a few years ago. I'm assuming the developer experience will still be terrible with lots of knobs to tune to actually get any decent cost/performance.
It's not very clear why I should keep using a normal Redshift instead of switching to Serverless? Anyone has more insights?
If you have a very stable SQL pipeline that runs frequently during the day then it's probably cheaper to have it on a normal Redshift cluster.
How is this different than Athena?
Athena is not a relational database under the hood. Very different technologies with different implications.
My impression is
Athena = Lambda + S3 (what i would call true serverless)
Redshift Serverless = Auto AWS Managed EC2 instances with local storage + S3
Although I could be wrong as I just had a quick 5 minute look at it...
Query things other than data in S3 seems to be the big difference.
Athena can query non-s3 data sources using federated query.
But that's not to say it's the best idea. Athena is mostly for S3; AWS say so.
From their FAQ [1]:
Q: What is the difference between Amazon Athena, Amazon EMR, and Amazon Redshift? [...] Amazon Redshift provides the fastest query performance for enterprise reporting and business intelligence workloads, particularly those involving extremely complex SQL with multiple joins and sub-queries. [...] Amazon Athena provides the easiest way to run ad-hoc queries for data in S3 without the need to setup or manage any servers.
Elsewhere they also recommend Athena for unstructured data. This and a relational database like Postgres are fundamentally different and you shouldn't use one when your use case is primarily meant for the other.
Two competing departments inside AWS.
Technically, Athena is based on modified Presto while Redshift is (very) heavily modified Postgres.
Competing departments and competing underlying technologies. Postgres (and therefore Redshift) is running a relational database engine under the hood.
Yes, though with ra3 storage being on S3, I think they are getting more and more similar.
So basically BigQuery from AWS. Looks good on first sight, a bit late. Personally worked for a large org which has just moved from BigQuery from Redshift and I have to say that BigQuery is the much better product.
The only problem with BigQuery is that it's on GCP, so either you have to migrate your whole workload over or you end up eating a lot of data transfer costs.
Having used both I do think BigQuery is better in a lot of ways (although it's easier to make it a lot expensive too), but I'm really excited to see Redshift catch up. Adding the serverless options are really great too since my biggest complaint with Redshift was managing the quantity and type of the underlying instances.
Athena is BigQuery from AWS. I'd have to imagine this is for people already heavily invested in Redshift.
Athena is not BigQuery. Athena is just a fancy wrapper around Hive (you can see this right from the log output) and just runs map reduce over your S3 data. It's a great tool for what it is.
BigQuery is a full database. It is _significantly_ faster than running anything from Athena. The closest comparison on AWS is Redshift.
I think Athena has a lot of value (especially given its pricing), but you're not wrong that it has limitations.
Getting data into Athena isn't something that is just done for you. Athena just takes what you've put on S3 and queries over it - and leaves getting it onto S3 (and into an efficient format) as an exercise for the reader.
Athena's speed varies a lot depending on what format you put things in. Querying over CSVs will mean that you're slow and reading a lot of data. Querying over ORC (column-store) files is pretty quick.
The big thing is Athena's pricing. They price it on how much data you read _in reality_ not how much data would be read if things weren't optimized. BigQuery charges you based on how much data would be read if it weren't optimized. With BigQuery, an integer is always 8 bytes. It doesn't matter if they're able to optimize it down to nothing using RLE (run length encoding). You still pay the full 8 bytes. If your ORC files make that integer column tiny, you get the benefit of that.
BigQuery is great, but Athena's pricing is a lot cheaper given that you get to benefit from any storage optimization you do.
Out of curiosity, how have you used Athena that you're seeing it be so much slower? In my experience, BigQuery is faster (maybe 2x faster), but I've been using column-oriented data with Athena. If you're using CSVs with Athena, it will be way slower than BigQuery.
I'm always a little surprised that AWS doesn't build Athena out more, but I guess if they did they'd want money and margin for the value add. Still, Athena is a pretty decent serverless Presto and Presto can work pretty well over data in column formats.
Athena is more of a wrapper over Presto / Trino than Hive. It does not run map reduce per se over your data - its a SQL engine.
AWS Glue Data catalog is Hive, and as you mentioned, Athena is a wrapper of Presto
It isn't. It is more similar to Snowflake's warehouse model.
Yeah I get that they want to go after Snowflake, Databricks and BigQuery, but AWS it not known for delivering high quality software. They have few things that are very good (EC2, S3, Lambda) and the rest is done by their B-team and barely holding together.
Keep telling yourself that
Just load your data
Load my data where? This is "serverless".
To "the cloud" :P
More seriously, "serverless" usually just means you aren't supposed to worry about server/cluster management, not that there are no servers anywhere. So it really means "load your data to Redshift, wherever that lives".
I look forward to the next stage of cloud hype - dataless - so I don't have to worry about the nuances of data management and I can just run my analytics.
I've long dreamed of that. Everything becomes stateless and everything to do with GDPR/PCI/Disaster recovery/anonymisation/performance/scaling/etc/etc becomes so much easier.