How to make any immutable data structure distributed

Author: pierremenard

Score: 114

Comments: 26

Date: 2021-12-05 19:50:19

________________________________________________________________________________

runT1ME wrote at 2021-12-05 22:22:00:

I really think this is where pure FP shines.

If you look at the architecture of something like Apache Beam, while you describe your computations in a language like Java or Python, you're really using a DSL for creating an immutable DAG that processes chunks of immutable data, persisiting each chunk (being loose with terms here) for durability, 'modifying' it in the immutable sense and then passing it on to the next edge and so on.

In a single standalone system, many argue that a purely immutable conceptual approach has pros and cons. In the Big Data world, I've never heard anyone argue for an imperative approach. Immutability is necessary, and as soon as you need immutability you want all the other goodies that facilitate it in a clean way like monads, lenses, and the like.

piyh wrote at 2021-12-05 22:52:59:

FP?

sethammons wrote at 2021-12-05 22:55:13:

Functional programming

Scarbutt wrote at 2021-12-05 23:21:34:

Besides performance, what other cons exists for immutable data structures in a single standalone system?

runT1ME wrote at 2021-12-06 00:59:05:

It's awkward to work with fully nested structures. Think about having a map of customer objects which have a list of addresses and you need to update everyone's first address to be capitalized for some raeson. You'd really want a fully fleshed out lens library and maybe even macros for auto generating them on your concrete data structures to make it easy to 'update' (via a persistent structure) without having to deal with all the cruft of doing it by hand.

melony wrote at 2021-12-06 03:04:56:

Memory usage, the fancier data structures (most of which are pioneered by Okasaki) in newer FP frameworks and languages are often very memory hungry.

Guvante wrote at 2021-12-05 23:45:20:

Sharing immutable data is hard. Mutable throw you throw a lock on it and mutate in place. How do you share mutations when you can't change anything?

To be clear it is certainly a solvable problem (as this post shows) but it can make things difficult.

nyanpasu64 wrote at 2021-12-06 00:19:13:

Sharing mutable data without a mutex (which suffers from unbounded contention) is hard. Approaches that work include updating persistent data structures and sharing the new copy, or sharing diff objects over a lock-free queue.

romero-jk wrote at 2021-12-06 00:11:57:

This is how clojure does it:

https://clojure.org/reference/atoms

It's very costly though.

dwohnitmok wrote at 2021-12-06 02:45:32:

Is it particularly costly compared to locks?

Under the hood it's a effectively a single CAS instruction that loops on failure (which only occurs under contention, but then you have waiting with locks too).

jmfldn wrote at 2021-12-05 20:29:47:

I really hope Unison succeeds. It looks like a really interesting take on FP / distributed systems. The fact that Paul Chiusano and Runar Bjarnason are at the helm is exciting, especially to Scala / FP devs like me who know them as the writers of the legendary "red book". Just based on the rare quality of that book, I'm super interested in other FP projects that they're involved in.

LandR wrote at 2021-12-05 21:53:26:

> writers of the legendary "red book".

If you mean the manning book "Functional Programming in Scala", I don't even use Scala, yet I feel I got so much out of that book. It's the best functional programming book I've read.

It would recommend the book to anyone that wants to learn FP, regardless if you will be using Scala.

cipheredStones wrote at 2021-12-05 22:50:55:

Tangential to the thread, but do you have any other favorite Scala resources you would recommend? I've had only glancing exposure to it, and will soon be taking a role where it's the primary language.

pchiusano wrote at 2021-12-05 21:56:45:

One of the authors of the article here, happy to answer any questions about it!

lidatong wrote at 2021-12-05 23:26:35:

I had a few questions:

1. The posts mention Spark, but were you (the Unison team) also inspired by dataflow / stream processing engines (e.g. Flink, Beam, Timely Dataflow etc.) or any other technologies?

2. Relatedly, are there any plans to add a stream processing API? Or perhaps that would belong in a 3rd-party library?

3. Broadly, where do you think (in your opinion) Unison fits in the ecosystem of tooling? Is Unison specialized at solving big data problems that require distributed execution e.g. something I pull out of my toolbox for a problem I'd otherwise solve with Spark? Or is it a truly general-purpose programming language?

4. With the Tree example, you show the "wrong" way and then the "right" way -- one problem I've experienced in the past is how do you as a new user know the right way. For example, I wrap my entire program in IO then .unsafeRunSync it -- technically following the rules, but got nothing out of doing that. Have you thought about how the APIs could guide the user to the idiomatic solution? Perhaps via types?

By the way, I really loved the FP in Scala book, it was my first step into FP -- thank you and Runar for writing it

rlmark wrote at 2021-12-06 00:35:34:

Hi there! I can't speak to questions 1-3, but with regards to question 4, some of the examples we generated were born from early mistakes I had stumbled into when writing the `Tree` functions, so I have some thoughts. It would be interesting to codify the behavior of evaluating the `Tree` in more types, but looking back, the thing that would have helped me most in avoiding some of the missteps might have been a richer test interpreter, where I could have "mocked" a network of, say, 3 nodes and distributed the `Tree` across them, emitting information about where the data was being moved, and where the various functions were being applied. We had actually briefly thought about writing such an interpreter for the purpose of this article and it's still something I'd like to do at some point.

pchiusano wrote at 2021-12-06 01:10:28:

So what we were going for with the core API is that it should be expressive enough for all kinds of distributed programming tasks, not just Spark-like or batch computing stuff.

We’re planning on doing several of these articles to show different applications. I think some distributed stream processing library would be super cool and there are lots of sources of inspiration for that like you mention. It would be a more interesting translation than the Spark-like stuff though. Likely using the channels ability for message passing.

> By the way, I really loved the FP in Scala book, it was my first step into FP -- thank you and Runar for writing it

That’s great to hear!

Touche wrote at 2021-12-05 23:18:33:

How do you perform side effects in Unison (like reading a file as an example)?

rlmark wrote at 2021-12-06 00:10:12:

Hi there, the other article author chiming in here! You can perform side effects in Unison via abilities (our name for algebraic effects) - they're described here:

https://www.unisonweb.org/docs/abilities/

Some basic IO functionality is supported by the `base` library (the standard lib) which contains functions like `openFile` which are effectful functions expressed in terms of the `IO` ability.

lidatong wrote at 2021-12-05 23:24:00:

I think there were a lot of great ideas presented in these small examples.

1. Explicitly marking non-local _data_ vs. in-memory. A very cool idea -- languages and libraries I've used all seem to try to make this transparent (unsuccessfully, as indicated by the random serialization errors you get)

2. Making the noise around serialization and RPC transparent to the user

3. Bring the code to the data, not the data to the code

4. Immutability, FP, and I like the syntax -- no coincidence in its similarities to Haskell and Scala, I'm sure :)

These are all pain points in a lot of existing "big data systems" I've used, where they either solve the problem half-baked or don't address at all. I'm excited to see where this project goes!

zomglings wrote at 2021-12-06 01:00:07:

Your use of the word transparent confused me a little. It seems you are using it in the sense that details that are being made transparent are being hidden from the user.

But if you make something transparent to the user, it could also mean that you are directly exposing that thing to the user. (Right? I could be wrong about this.)

Just confused me and thought it might be useful to you to know that. Not trying to be an asshole.

pchiusano wrote at 2021-12-05 22:13:02:

Recommend starting with part 1 of the article if you want more of an overview:

https://www.unison-lang.org/articles/distributed-datasets/

jltsiren wrote at 2021-12-06 00:06:23:

I dislike the term "immutable data structure", because it has two widely used conflicting meanings:

1. Read-only data structures, which are used by everyone in every language all the time.

2. Persistent data structures, which make the underlying immutable structures kind of mutable by creating new versions and reusing parts from the old versions of the structure.

mopierotti wrote at 2021-12-06 00:27:53:

Aren't those the same things? i.e. 2. are just read-only data structures with functions to create new read-only data structures based on them. (e.g. immutable linked lists that you can filter, concat, etc. with)

Or are you talking about something more specific, like mutable data structures that provide immutable snapshots of the backing data?

skyde wrote at 2021-12-06 00:29:41:

lol the word "persistent" is itself ambiguous an have conflicting meaning:

https://en.wikipedia.org/wiki/Persistence_(computer_science)

In computer science, persistence refers to the characteristic of state of a system that outlives (persists more than) the process that created it.

https://en.wikipedia.org/wiki/Persistent_data_structure

In computing, a persistent data structure or not ephemeral data structure is a data structure that always preserves the previous version of itself when it is modified.

armchairhacker wrote at 2021-12-06 00:43:15:

I usually distinguish "readonly" values (references which block mutation, but you can mutate via another reference) and "immutable" values (what OP calls persistent, where modification may copy-on-write but you're certain the original reference won't mutate).