Making Sense of Stream Processing by Martin Kleppmann

Created: 2021-02-10T22:36:33-06:00

Action triggers an event.

Event captures details about the action.

Event sourcing: storing changes against a datum instead of applying and discarding the changes in-place.

Change Data Capture: event sourcing as a policy; you write and maintain a list of changes to a datum since some prior snapshot or genesis.

Job queue vs event log: events which order does not matter as much, going to an available worker is most important. other events order is paramount over parallelism.

Book really seems to like storing events in logs.

Most awful bad thing: doing double writes; data eventually becomes inconsistent and stays that way forever. Book recommends application servers write to append-only logs like Kafka and caches have kafka readers that read change logs on their own and update the indices in write order.

The truth is the log. The database is a cache of a subset of the log. - Pat Holland

Derivation: computing something based on input data; ex. the most liked comment is derived from having the most +1 events.

Streaming query: an actor that watches events, filters for events of interest, and generates its own events based on hits.

Aggregation: reducing many events to statistics about similar events; reducing a list of changes to a datum to a snapshot of the datum's current state.

Storing data: individual records vs. aggregates. Aggregate smaller and faster but lose fidelity; individual records requires processing volume.

B-Tree: a tree which has pages representing ranges, used by most database-like things. sometimes has to split pages to store more items within particular ranges.

Write Ahead Log: an append-only log where events are stored and comitted to disk prior to updating indices.

Log Structured Storage: when the index is based around sourcing data from the write ahead log, under the assumption you are storing the data in a log anyway.

Raft consensus: synchronizing a strictly ordered log across nodes, creating current world state from this log.

Kafka: stores event log and presents it for consumers to read from; individual consumers can read from different parts of it.

Dual writes: when an application does a thing, then has to notify the caches to update.

Kafka log partitioning: an entry is sent to a different log based on its key.

Kafka log compaction: when a log is compacted only the latest write to a key within a log is kept, while all prior changes are erased.

Avro serialization: a format which embeds the schema of the data inside the avro file, so it can compare it to the schema of the program trying to read it and perform upgrades as necessary.

Unix philosophy: make programs that do one thing well, and assume its output will be the input to some as yet unwritten program.

Minimizing irreversibility -- Martin Fowler

Consistency when using a log/kafka model

Have a stream which processes attempts to claim a username and a processor which watches this stream and replies based on successful claims or uniqueness violations.
Post an event claiming the username
Wait for the claim handler to validate uniqueness
Return OK or an already-taken error.

Using kafka to manage the database

Changes in world state are issued as events to kafka's log
Workers consume log entries from kafka and update their relevant caches
Log compaction eventually deletes old change notices while retaining the latest value for each key

If a new index is to be made, a worker for the new index is spun up from the genesis of the kafka log. It will then index all the data (which is being stored in the kafka log as the single source of truth) and changes until it eventually catches up. Then the old index can be decomissioned.