Comment by [deleted] on 05/02/2021 at 22:49 UTC

32 upvotes, 6 direct replies (showing 6)

View submission: Diamond Hands on the Data 💎🙌📈

[deleted]

Replies

Comment by Seaoftroublez at 05/02/2021 at 23:31 UTC

50 upvotes, 0 direct replies

There are a few ways to scale databases.

Simplest way to scale is to throw more resources at the underlying computer that hosts the db. If using AWS RDS, this is increasing the vCPU count, memory, and volume size.

With SQL-like databases where there is a guarantee of ACID, the bottleneck for scaling is the ability to "write" to the database, since writing requires locking mechanisms.

In this case, you can typically split up the single writer from multiple readers. When someone writes to the writer database, the data is automatically replicated to the readers. There's usually a bit of a delay, can take a few milliseconds.

Readers are extremely easy to scale. You just add another "computer". When someone tries to read from your collective database, they'll pick one of the readers to read from. There is tooling that does this for you automatically. Like web load balancing, but for databases.

Scaling up "writers" is more difficult. One approach is called sharding. Essentially it's partitioning data across distinct databases. If you generate a unique identifier for each comments based on some property (maybe user ID), even numbers could go to database A while odd numbers go to database B. In reality it gets a bit more complicated, but that's the gist.

Before scaling up the databases more, you may want to improve the cache layer. A typical cache is a key value store and so random access is much faster than a database. Solutions like redis have built-in sharding for this. They're a lot easier to scale than a database, especially when you don't care about referential integrity. Downside is that random access storage is more expensive than sequential storage.

Comment by misledyouth56 at 06/02/2021 at 03:00 UTC

13 upvotes, 0 direct replies

To add to this, from Keyser's comment it sounds like they are using Cassandra, which is a NoSQL database that forgoes some of the consistency guarantees you get with an ACID compliant db ( like MySQL or Postgres) in favor of the ability to scale and to have your data replicated across multiple nodes in a cluster that can have nodes added and removed at will. A subset of data gets replicated to a subset of nodes in the cluster, making it more resilient to outages of a single node.

When the DB needs to scale, you can add another node to the cluster and the DB does the work of balancing your data and writing to the new node you added. The trade-off is that writes are not instantly consistent across all nodes, but reach a consistent state eventually.

Comment by flippedalid at 05/02/2021 at 23:18 UTC

8 upvotes, 0 direct replies

There are two types of scaling: verticle and horizontal. Vertical is one machine getting more power. And horizontal is splitting load between multiple machines. There are many tools and software in place to handle horizontal syncing. Reddit is definitely not just one server. They sync data across MANY countries and regions so their infrastructure has to be well thought out and synced accordingly. If you want to learn more about it you can look up "horizontal scaling". I think there's a few articles out there about how Facebook, Google and some others handle their data. Reddit would function in a similar manner.

Comment by secrav at 05/02/2021 at 23:15 UTC

3 upvotes, 0 direct replies

There are actually a lot of servers having the same data that way if one fail you have backup, and you can balance the load between those servers. Scaling up is just adding servers that will copy that data to get ready, then here you go.

Not sure it's well explained.

Comment by Gijahr at 05/02/2021 at 23:25 UTC

2 upvotes, 0 direct replies

Check out horizontal scaling, you can indeed run a database over multiple machines.

Comment by redtexture at 07/02/2021 at 06:02 UTC

1 upvotes, 0 direct replies

Concepts:

Master and slave data machines