TimescaleDB 2.0 released, now distributed multi-node

Author: manigandham

Score: 209

Comments: 49

Date: 2020-10-29 16:08:22

Web Link

________________________________________________________________________________

eloff wrote at 2020-10-29 17:39:55:

Wow, this is a big deal. A free, distributed timescale database built on top of trusty Postgres, with column oriented, compressed storage for the time series data.

Also as I've stated before on HN, I love the new license which gives you almost all the freedoms of open source software while still letting TimescaleDB create a profitable business and not get squashed by AWS - which benefits nobody.

I have lot's of questions:

Queries on distributed hypertables are run across the cluster, but what about regular queries to regular tables? Write transactions obviously need to happen in one server, but is there a way to split read queries across the cluster? I don't know if regular table data is duplicated across data nodes, or if it's only on the access node and joined transparently to the result set from the distributed hypertable query.

There are 3 kinds of server types for the cloud offering on AWS, IO, compute, or storage optimized. These seem to imply the database storage goes on instance storage, not EBS. This performs much better than EBS, especially on the high IO NVMe instances AWS offers. It also offers more predictable performance and tail latency because there is no network involved, no distributed block storage system (and EBS has had more than it's fair share of outages.) Is this true? I know compose.io (IBM) does this for their managed PostgreSQL for similar reasons.

I'd like to compare the pricing to RDS for same-sized instances + storage, with moderate workload so I can get a feel for the markup on the managed service and know at at a glance, ok, this is going to cost me 50% more than RDS but I also get all these TimescaleDB features.

I'd also like to see some performance comparisons, just for regular read + write workloads on Postgres, because if I'm right about using instance storage vs EBS the same sized hardware will give more transactions per second for TimescaleDB.

mfreed wrote at 2020-10-29 17:55:12:

Thanks for all the questions @eloff! I might try to separate responses for hopefully some threading :)

The software in our architecture is actually unified: You spin up K instances of TimescaleDB, log into one of them and then call "add_data_node" to attach others and data nodes, and "create_distributed_hypertable" to make this node the access node _for that particular hypertable_.

In fact, you can explicitly specify different data nodes as the "backend" for different distributed hypertables, e.g., same access node, but data nodes {A, B} for distributed hypertable 1; data nodes {C, D, E, F} are for distributed hypertable 2. You could also have different nodes play the role of access nodes for different hypertables and vice-versa. Lots of flexibility here, although we imagine most users won't need it, so by default your distributed hypertable is created across all connected data nodes.

Now, because these nodes are just running TimescaleDB, if you were to create a regular table or even a regular (non-distributed) hypertable on that access node, it would be just stored locally. Then, any JOIN between the distributed hypertable data and regular table happens transparently on the access node.

Now, that's what in the current release, but not surprisingly, we plan to add support for better local JOINs. For regular tables, one approach will likely be to provide an option to replicate the table on all data nodes, so that you can perform the JOINs locally (and because TimescaleDB 2.0 already supports 2PC for individual rows across data nodes to support replication factors > 1, that machinery somewhat already exists). A second approach for regular tables that are also partitioned -- particularly if using the same "space" partitioning key as your distributed hypertable -- is to collocate partitions of the regular table across the data nodes, so that JOINs on the space partition keys (e.g., server_id, user_id, etc.) can again be local.

eloff wrote at 2020-10-29 18:24:45:

Ok, I think I understand. Each node is just Postgres with TimescaleDB, and acts as a separate Postgres server with it's own replication (if any) - at least at present.

So your "fact" regular tables are created on the server you configure as the access node and then you can join with distributed hypertables in your queries.

If one wanted the same fact tables on multiple access nodes you could either use postgres partitioning or duplicate them somehow yourself.

I like the flexibility here in cluster configuration. It's good to have options when scaling a system.

mfreed wrote at 2020-10-29 18:28:07:

Yes, with the observation is that this is the base flexibility, while we will be adding much more transparent stuff in the future. So in 2.0, if you want a chunk of a distributed hypertable replicated, you don't do it manually, but just configure the distributed hypertable to use "replication_factor = 2", etc.

Native replication (and the approaches I described for regular "fact" tables as well) are major focuses of the 2.x roadmap.

mfreed wrote at 2020-10-29 18:03:34:

Regarding offerings on Timescale Cloud around "IO-optimized", "compute-optimized", and "storage-optimized", and particularly that Timescale Cloud runs on all three major clouds (AWS, Azure, GCP) across 76 regions:

This indeed reflects that these different options are using different VM instance types, as well as different storage options, including both local instance storage _and_ remote block storage (EBS).

You can partially see this also in the maximum storage capacity of options, where some service types can do 10TB of storage, while others max out at 500GB. As well as in variety in which options we can offer in which clouds/regions based on the underlying hardware capacities that are available.

(Note this is the maximum sized "disk". Because TimescaleDB's compression on average gets 94-97% storage reductions -- measured across the field -- the amount of raw data you can actually store on these options is much higher. And why also it's not an apples-to-apples price comparison against RDS!)

So it's exposing this tradeoff in performance/price/capacity to users to hopefully find these exact needs. And the good thing is there is a "one-click" fully-online migration between instance types (and regions/clouds), so switching is always quite easy, especially as your usage grows.

eloff wrote at 2020-10-29 18:31:09:

That's really flexible. I think you're missing a chance to show TimescaleDB mopping the floor with RDS on throughput and tail latency on the high IO instances on a cost per transaction basis - for regular Postgres tables and workloads - time series features aside. There's a good blog post waiting there.

I didn't know about the one-click online migration, even between regions and clouds, that's huge! It's not a feature RDS will ever have for obvious reasons.

akulkarni wrote at 2020-10-29 17:26:42:

Hi all, thanks for the reading the post. We're very excited to finally launch TimescaleDB 2.0.

In fact, users have already been running multi-node TimescaleDB in continuous daily use for many months, including a 22-node cluster by a Fortune 100 tech company ingesting more than a billion rows per day. So this product is already quite robust :-D

gabereiser wrote at 2020-10-29 17:37:10:

This is so good, the features, the licensing, all of it. I'm not going to argue the use-case or NewSQL vs NoSQL or anything but for someone who wants PostgreSQL-level knowledge transfer working with time series data (audit logs, event history, change management, etc), this is amazing! Good work!

akulkarni wrote at 2020-10-29 19:31:21:

Thank you! Great set of use cases as well :-)

pachico wrote at 2020-10-29 17:52:45:

From what I've seen, performance is still much worse than Clickhouse, that was always distributed, open source, data warehouse like and feature rich.

Why should I use timescale?

I'm really asking, I'm not being rhetorical.

hardwaresofton wrote at 2020-10-29 18:01:04:

Clickhouse and Timescale are different types of databases -- Clickhouse is a columnar store and Timescale is a row-oriented store that is specialized for time series data with some benefits of columnar stores[0].

Something like InfluxDB is a better thing to compare to TimescaleDB (and TimescaleDB does very well, though the benchmark was a bit old[1] and influx might have improved in the meantime).

Database types aside, what really gets me excited about Timescale is that it's just another Postgres extension. If you're already running a Postgres cluster for your OLTP workloads (web-app-y workloads) and have just a bit of fast-moving time series data (ex. logs, audit logs, event streams, etc), Timescale is only an extension away. You get the usual time-tested battle hardened Postgres, with all it's features _and also_ support for your time series workloads. Yeah you could set up declarative partitioning yourself (it is a postgres feature after all) but why bother when Timescale has done the heavy lifting?

[EDIT] - see the response below -- the benchmark is up to date, and Timescale does even better against the purpose-built tool that is InfluxDB.

> Note: This study was originally published in August 2018, updated in June 2019 and last updated on 3 August 2020.

[0]:

https://blog.timescale.com/blog/building-columnar-compressio...

[1]:

https://blog.timescale.com/blog/timescaledb-vs-influxdb-for-...

mfreed wrote at 2020-10-29 18:10:13:

Thanks for thoughts here.

One thing we continually hear is that the familiarity, trust, ecosystem, maturity, etc. that folks love about Postgres is a huge boon and one deciding factor for their adoption of TimescaleDB.

Just to clarify: those benchmarks about InfluxDB vs. TimescaleDB were fully redone just 2 months ago (August 2020) with the latest versions of both, so should be quite up-to-date. In fact, since the ~year since our last benchmarking, TimescaleDB's performance _relative to InfluxDB_ only significantly increased.

"Version: TimescaleDB version 1.7.1, community edition, with PostgreSQL 12, InfluxDB version 1.8.0 Open Source Edition (the latest non-beta releases for both databases at the time of publishing)."

https://blog.timescale.com/blog/timescaledb-vs-influxdb-for-...

hardwaresofton wrote at 2020-10-29 18:22:27:

> One thing we continually hear is that the familiarity, trust, ecosystem, maturity, etc. that folks love about Postgres is a huge boon and one deciding factor for their adoption of TimescaleDB.

Yeah, I think this is huge, I also personally like that you get the extensibility of Postgres as well. Right now custom table access methods are still "cooking" but I think being able to combine a true postgres-native columnar access method with Timescale's benefits would be a game changer. There's also zheap which is still being worked on but if/when it lands postgres will be even better at OLTP workloads and possibly an even better base for Timescale to stand on.

> Just to clarify: those benchmarks about InfluxDB vs. TimescaleDB were fully redone just 2 months ago (August 2020) with the latest versions of both, so should be quite up-to-date. In fact, since the ~year since our last benchmarking, TimescaleDB's performance _relative to InfluxDB_ only significantly increased.

That's fantastic to hear -- I am already sold on Timescale since to get close to a purpose-built tool with a solution built on top of a more general platform is already very impressive, but I will be re-reading the article closely to get more details on the exact trade-offs.

pachico wrote at 2020-10-29 18:21:10:

I mean using Clickhouse for time-series, of course.

I understand your point on adding a new feature to your already existing Postgres solution.

It's kinda what I do by using MySQL engine and dictionaries with Clickhouse, I assume.

hardwaresofton wrote at 2020-10-29 18:28:43:

Yes -- but slightly different, but _without_ the network hop!

MySQL engine for Clickhouse sounds like dblink[0] or foreign data wrappers(fdw)[1] in Postgres. Doing it with Postgres allows for way more flexibility (the data could be local or remote) in this case, and the data will be _at home_ in Postgres, with all the stability, features, operational knowledge (and also bugs/warts of course) that come with Postgres.

You may never get 100% of the performance you'd get from a purpose-built database that doesn't make the choices Postgres makes but the idea of getting 80/90% of the way there, with only one thing to maintain is very exciting to me.

[0]:

https://www.postgresql.org/docs/current/contrib-dblink-funct...

[1]:

https://www.postgresql.org/docs/current/postgres-fdw.html

avthar wrote at 2020-10-29 20:03:58:

Ps the benchmarks cited in [1] for TimescaleDB vs InfluxDB were actually updated in August 2020, so the numbers are fairly recent :)

manigandham wrote at 2020-10-30 02:19:18:

Native column-oriented data warehouses designed for OLAP queries will always be faster. There are multiple alternatives from Clickhouse to Redshift that will be faster.

Originally I didn't like Timescale because it didn't offer anything new but the product has improved greatly over the years. Today it's close on performance by using a custom column-oriented data layer that stores the actual chunks in PostgreSQL rows and has several time-related processing and analytical features (continuous aggregates, time bucketing, smoothing values, etc) that make it easier than doing it yourself in raw SQL.

One of the big advantages is that it allows you to use Postgres which means you can continue to use it as your main OLTP operational database as well. This avoids a lot of complicated polyglot issues like syncing datasets or using different querying systems with different syntax. It's one of the better examples of using Postgres as a data platform rather than a simple database.

There are other alternatives that combine this OLAP+OLTP functionality like Citus (another automatic sharding distributed database extension for Postgres), Vitess (automatic sharded mysql), TiDB (natively distributed mysql interface on top of key/value store), MemSQL (proprietary distributed mysql interface with ram-based rowstores and disk-based columnstores) and SQL Server (with hekaton column-stores, in-memory tables, and scale out).

szemet wrote at 2020-10-29 18:04:05:

Based on more mature codebase.

/When I tried Clickhouse, I managed to segfault it with NULL pointer dereference error. Ok it is anecdotal, and maybe I just had bad luck, same could happen with Postgres etc etc... But anyway: it can be a deciding factor.

(And they fixed it quickly - issue 7955 on github) /

parhamn wrote at 2020-10-29 17:26:23:

But a lot of NoSQL adoption was a knee-jerk reaction, along the lines of, “relational databases don’t scale, so I need a NoSQL database.”

I don't agree with this. If you view relational-ness as a spectrum its not so simple. I think part of this knee-jerk reaction was because we over advised things like "denomralize everything" which lead to these monstrous schema diagrams that effected everything in the application pipeline: query performance, permissions, caching, transaction isolation, realtime capabilities, understandability, schema management nightmares, and more.

In fact the world proved through NoSQL that a lot of what hyper-relational models we're providing us weren't required to run a business. And you could succeed and do things quickly when taking an extreme stance against relations.

I think the average app should probably revert to something in the middle. I've been building applications recently with only acyclical hierarchical relationships. Forcing that constraint up-front does tie your hands a bit but it makes EVERYTHING a ton easier once you design your models around it.

mkindahl wrote at 2020-10-29 18:19:36:

I was on the relational side (I worked for MySQL) when MongoDB first sprung up, and IMHO it was to some extent a knee-jerk reaction.

MongoDB claimed far superior performance for web-applications simply by not persisting data to disk. MongoDB offered one thing that was very useful to web-developers: a good API for storing JSON data directly into the database. Not having transactional semantics was of course a terrible idea once critical data was stored into the databases and the systems started to have crashes, but it worked for small applications and while developing the applications nothing looked amiss.

Since web-developers are not database experts and many did not have experience with deploying large-scale applications, it was of course very appealing to have both performance and simplicity. But as the downsides of not persisting data started to become a real problem, many applications moved away from MongoDB and into other systems that offered transactional capabilities and durability of data. At the same time, MongoDB started to implement transactional support, which of course affected performance, but allowed applications to remain on MongoDB.

All in all, I think the takeaway is that database APIs matter to developers (not really very surprising) and it is important to have a good API to the database, but transactional behavior is important for real applications and it is essential to have that as well.

With JSON being added to PostgreSQL and MySQL, I think that some of that simplicity that MongoDB offered is available in PostgreSQL and MySQL as well. (The API is still an SQL API, but at least you can easily store and retrieve JSON documents directly from the database.)

I think the main problem during that time is that the issues of performance and API was conflated and a general assumption grew that NoSQL also meant high performance, which is not necessarily the case.

eloff wrote at 2020-10-29 17:46:52:

I think NoSQL works for a very limited number of cases. Most of the time you just end up moving all the schema checking and integrity checking (foreign key enforcement) to the application level and then you're squarely worse off.

Plus MongoDB, which is the poster child of the revolution, has been a technical pile of shit for a decade. It's slower than JSONB on Postgres, and loses data in the default configuration, was insecure in the default configuration, failed Aphyr's Jepsen distributed system consistency tests every time - in other words promised things about consistency that were lies in their documentation.

They have improved a lot, but I feel so sorry for people who jumped on that bandwagon early.

My interest lies in databases and database internals, and I spent 8 years building databases in my spare time. I always advise everyone that they should use a relational database unless they are very, very sure their requirements don't require it and won't ever require it. It's much easier to start with Postgres and switch to NoSQL later if you really need to go that way than to do it the other way around. Most times you find you don't _need_ to switch to NoSQL. The imagined hordes of users never materialized to overwhelm your trusty relational database.

mfreed wrote at 2020-10-29 17:40:10:

One of interesting things we’ve found is that the typical data models we’ve seen in TimescaleDB (and time-series workloads in general) is that, while having JOINs and additional tables with metadata, business info, etc. is super useful -- and missing from most other time-series databases, leading to frustrating polyglot solutions that deploy them alongside Postgres/MongoDB/whatever -- it’s much less common to see the terribly deep snowflake schema patterns that came to dominate OLAP, with levels-upon-levels of normalization.

And certainly the fact that your time-series hypertable in TimescaleDB can also store JSON, given Postgres’ underlying capabilities, also helps here and thus provides developers with flexibility given their application/performance needs.

vamega wrote at 2020-10-29 17:35:28:

Can you expand on what you mean by an acyclical hierarchal relationship?

manigandham wrote at 2020-10-30 00:19:08:

What spectrum? The actual database types are relational, key-value/wide-column, document-store, graph, etc.

NoSQL is a meaningless buzzword but one good outcome was more attention to non-relational systems and where they can be a better fit, as well as creating more progress for those systems as well like better usability and APIs to match the relational dbs.

merb wrote at 2020-10-29 17:38:52:

actually I think the typical application should mix "NoSQL" paradigmas and relational.

consider a Offer model, do I really need OfferPosition's as an own table or can I include them via something like "jsonb".

jayd16 wrote at 2020-10-29 17:41:16:

It's a weasel statement that only claims to "a lot" of user sentiment. It's not something you can easily disagree with because the claim is so vague.

tehlike wrote at 2020-10-29 17:42:55:

Postgres is hell of a no-sql database :)

binakot wrote at 2020-10-30 06:10:05:

This is the awesome news! Waiting for release and start to preparing for migration from single-node setup to distributed one. If someone just want to try distributed release candidate with simple setup (1 access node + 2 data nodes) I have the demo on my github:

https://github.com/binakot/Multi-Node-TimescaleDB

. You are welcome :)

hardwaresofton wrote at 2020-10-29 18:10:31:

TimescaleDB is purpose-built for time-series workloads, so that you can get orders of magnitude better performance at a fraction of the cost, along with a much better developer experience. This means massive scale (100s billions of rows and millions of inserts per second on a single server), 94%+ native compression, 10-100x faster queries than PostgreSQL, InfluxDB, Cassandra, and MongoDB – all while maintaining the reliability, ease-of-use, SQL interface, and overall goodness of PostgreSQL.

_WITH_ the links to back up their analysis/comparisons. I think I was first sold the hardest on Timescale when I read the InfluxDB post.

Timescale also releasing their hard work over the last ~3 years with a very permissive, surprisingly business friendly license[0] which was previously discussed on HN[1].

This is pretty huge -- personally one of the things I'm really looking forward to is using Postgres as a backing store for Prometheus now, which timescale actually worked on[2] already. If you look really closely, there's actually a way to get _all_ your observability data into Postgres (i.e. zipkin/jaeger, logs, and prometheus for metrics), and Timescale is going to make _all_ of those things easier to scale and maintain. For those following along at home, postgres already has some not terrible full text search[3] and ways to integrate with elasticsearch like zombodb[4]...

[EDIT] - Just to make what I'm hinting at a little clearer, I think you might be able to build a Graylog[5] type system _really_ easily with _just_ Postgres these days. Imagine deploying a single binary that only needs a single database to do _everything_ it needs to do.

[0]:

https://blog.timescale.com/blog/building-open-source-busines...

[1]:

https://news.ycombinator.com/item?id=24579905

[2]:

https://github.com/timescale/promscale

[3]:

https://www.postgresql.org/docs/current/textsearch.html

[4]:

https://www.zombodb.com/

[5]:

https://github.com/Graylog2/graylog2-server/

robertlagrant wrote at 2020-10-29 19:37:50:

That is pretty awesome. I'd love to see a postgres-aware cache plugin that keeps the latest versions of data in memory, SAP HANA-style.

robertlagrant wrote at 2020-10-29 17:34:32:

I'll nominate myself as silly question asker.

Has this been used as the persistence for an event sourcing service? Does that sound like a good/terrible idea?

Does it play nicely with SQLAlchemy?

mfreed wrote at 2020-10-29 18:17:54:

Yes, works with SQLAlchemy. (Like many ORMs there are a couple things to be aware of, but plenty StackOverflow questions about these already :)

And yes, people use it for event sourcing. Often see this in various product analytics use cases.

ryanbooz wrote at 2020-10-29 18:24:14:

Although I can't point to a specific application that's currently using TimescaleDB for an event sourcing service (I assume you're talking about the design pattern), it would be a great use case. Change in state over time is perfect time-series scenario.

As for SQLAlchemy, TimescaleDB is fully PostgreSQL compliant, so it will work out of the box. There will be a few nuances that need to be considered, particularly when writing queries that use some of our additional time-series functions like time_bucket.

But day-to-day ORM CRUD patterns should work as you expect!

robertlagrant wrote at 2020-10-29 19:27:58:

Thank you (both of you!). Agreed - I mean the design pattern, but I was being a bit vague.

johnthescott wrote at 2020-10-29 20:53:27:

are pg extensions supported?

-j

mfreed wrote at 2020-10-29 21:15:21:

Yes, TimescaleDB plays friendly with most PG extensions.

And Timescale Cloud comes pre-loaded with many of them -

https://kb.timescale.cloud/en/articles/2754991-supported-pos...

alireza94 wrote at 2020-10-29 20:12:57:

Yes. We’ve used Timescale with SQLAlchemy and overall everything works.

robertlagrant wrote at 2020-10-29 16:55:25:

Really, really interesting. I struggled to find prices for the cloud editions?

hnmullany wrote at 2020-10-29 17:02:33:

https://www.timescale.com/cloud-pricing

lights0123 wrote at 2020-10-29 17:12:46:

there's also

https://www.timescale.com/forge#cloud-pricing

robertlagrant wrote at 2020-10-29 17:26:58:

Thanks both! Could not find that!

johnthescott wrote at 2020-10-29 20:52:12:

getting a javascript error in brave browser on mac. says option not defined.

mfreed wrote at 2020-10-29 21:32:45:

Sorry about that! Can you email me any screenshots / info at mike (at) timescale? Thanks!

mkurz wrote at 2020-10-29 18:58:46:

@Timescale devs,

what are the differences between Timescale and Citus? When do I want to use which? AFAIK Citus scales multi-nodes long time already, your solution of hypertables reminds me of Citus distributed tables... Citus can do rebalancing long time as well... And much more.

Would love to hear what you think makes you different.

ryanbooz wrote at 2020-10-29 20:11:35:

TimescaleDB is purpose built for time-series data on Postgres. We offer native compression on this data and smart, easily adjustable partitioning, continuous and real-time analytics, automated time-series data management (retention, reordering, tiering, etc) and so forth... again, all focused on time-series data. With a few specific commands, we abstract all of the time-series partitioning for you and provide time-series specific functions to make queries more efficient and useful.

Citus is a great product that primarily focused on multi-tenant SaaS analytics, where you typically shard your queries on the tenant_id, but didn't focus on time-series. So, their current documented solution for time-series uses other extensions (pg_partman), doesn't provide for compression, and all the other essential tools to make massive amounts of time-series data useful, which TimescaleDB is hyper-focused on solving.

didip wrote at 2020-10-29 20:34:33:

Hi Timescale folks,

Can you guys publish a more detailed architecture design on how you scale the multi node? I am particularly curious how you make the Access Nodes multi-master and HA.

mfreed wrote at 2020-10-29 21:13:31:

Our docs cover some of the data node management [0] and native replication [1] particular in more detail.

Right now you can use Postgres physical replication for the access node, with one as the primary (and handling DDL/writes), while Access Node replicas can scale reads to the rest of the cluster. As alluded to in the blog post, we'll probably look to enable clients to write directly to data nodes as a form of further scaling writes (so that the AN "primary" would only be strictly required for DDL), as opposed to a multi-master approach.

[0]

https://docs.timescale.com/v2.0/using-timescaledb/distribute...

[1]

https://docs.timescale.com/v2.0/using-timescaledb/distribute...

kev009 wrote at 2020-10-29 19:05:42:

With my FreeBSD ports maintainer hat on, I don't see a 2.0 tag in github, was this blog premature?

RobAtticus wrote at 2020-10-29 19:19:08:

The latest tag is 2.0.0-rc2 since it is technically a release candidate and not the final cut.