💾 Archived View for soviet.circumlunar.space › zwatotem › diff › pivot.gmi captured on 2024-06-16 at 12:55:55. Gemini links have been rewritten to link to archived content

-=-=-=-=-=-=-

Change the point of view

This is a popular saying: "To change a point of view", "change a perspective". But I never would have imagined that I will be using it to prove my point about technology.

Or maybe it's about cognition, but I don't know. To the point, though.

Excel

I always liked to use Excel ever, since I learned about it. It was amazing for me, that I can change a value in one cell, and that can trigger a cascade of changes in the entire spreadsheet, all automatically and in a blink of an eye. Most of my school colleagues were laughing at Excel (like everything Microsoft, naturally) as being a tool for fools and uneducated people, while real professionals use programming languages. And I agreed with them about that, at the beginning at least, but with time I did less and less.

The strong turning point was the understanding of pivot tables.

Before I understood them, every ever so more complicated task would take me significant time, consisting of using multiple new columns or even multiple new versions of the original table, transforming from the original one with complicated formulas. At this point it really would be easier to just do this in Python (even that I hated, and still do hate its inconsistant API).

But after I learned them I achieved new mode of data analysis. In the time that other would write one function in Python, I could solve several tasks on a dataset just by changing the parameter column, the described variables, the filters, and the aggregation function.

Back to programming

After I started my studies on Computer Science, I forgot about spreadsheets for a good time. Now all the buzz was about programming, although notably I got back to Excel during my Probability & Statistics course, frustrated by Pandas, and not ready to learn R.

For the time I took the approach of functional languages made most sense for data manipulation. The maps, filters, reduces a.k.a. aggregates, groups and zips were basically the only things you needed to perform any operation on data you needed, without your brain exploding.

The main reason for this, now that I think about it, was that there was really only one operating dimention at a time. You always thought in terms of a list of records, no matter how many field each record had. To look at it from another perspective you grouped by another property, to get an entirely new set of records.

Even if the records were multilevel objects it was still easier to look at them than in a data frame, which is inherently only a 2D structure, trying to emulate a multidimentional data.

Finding pivots in the coding itself

The state machine adventure

Recently (e.i. half a year ago: the time passes quickly) I was tasked with creating a framework for creating a state machine in C++. I was very ambitious about it. Wanted it to be as performant, as possible, so based on templates, and computed in compile time. I also wanted it to be as generic as possible, so that meant:

Full hierarchy tree for the states
Full hierarchy tree for the events
Free choice of level of detail for filling in the transition table.

For example I could have:

Filled every combination of most particular states and most particular events, and define a transition for them
Defined a master transition for general event and general state, and override this for some more particular cases.
Defined transitions for two states in the middle of the hierarchy for every event, and for all the other states defined transitions only for the two general kinds of events.

Sounds complicated, and that's because it is, at least in text. But on the graphic that I prepared to explain this to my supervisor, this all made perfect sense.

A table explaining how the suposed state machine framework would work.

Ultimately I didn't succeed in defining the project this way. Not because of the flow in this approach, but because I wanted everything to be templated, and that turned out to be technically impossible.

But abstracting from that, even if I succeeded it would be still hard to understand the code utilizing this framework. Provided it had a sizable enough states and event count (coming from necessery complexity), it would span a large scrolling space, that would look like either of the too nested lists:

State1:
-> Event1: (transition definition)
-> Event2: (transition definition)
-> ...
State2:
-> Event2: (transition definition)
-> Event3: (transition definition)
-> ...
...

Event1:
-> State1: (transition definition)
-> State2: (transition definition)
-> ...
Event2:
-> State2: (transition definition)
-> State3: (transition definition)
-> ...
...

Neither way is ideal, you may want a different one, depending on needs, but you must commit to one.

Back to dependencies

I encourage to refresh the ideas from the previous log at this point:

./dependencies.gmi

In short it talks, among other things, that the difference between functional and OO programming is that OO defines different behaviours with data (functions in classes) and functional defines them with functions (match expressions).

Lately I was exploring different articles on the internet about non-text programming. (This was sparked by my annoyance of code formatting at work, and how hard it is to find consesus with your coworkers on it).

One of the articles I found particularly inlined with my line of thought:

Programming without text (archived version, because the original is down)

It outlines a problem that is described in this article from the beginning, but it provides a name: "Expression Problem". After further reading on the subject I realized that in essence the solution was there the entire time: pivot tables.

The solution I have in mind

When dealing with an expression problem you always have to "modify existing code" when you add a new case (either a fuction or a data type). But what does it really mean to modify existing code? It's a matter of philosophy, really. If we consider a class as an integral whole, then sure; adding a function to it can be considered modifying it. Similarly; if we add another function case we can say: it's the same function, so we did de-facto change it. But if we throw away this kind of thinking out the window adding a new case is just that: adding things, not modifying. In a simple case of single argument functions pivot tables don't really have anything to pivot: it's just a two-dimentional table, where you can add a row or a column, and fill it to satisfy the compiler.

Where pivot tables really shine is in the case of multiple arguments. Then you can change, how you want to see your program. You can for example fix the function in-view, and spread the first argument in columns and second argument in rows. You may as well fix the first argument and display different functions alongside the other axis, fuctionally getting the OOP-style of grouping, threating the first argument as the *object* upon which the function is called, and all the other arguments as regular arguments.

The whole magic must sit under the hood though: the representation of the logic underneath all this should be view-agnostic, and the compiler should be able to deduce the implementation for each cell of a table, based on the predicates that are defined in the row and column headers for that cell.

Dependency problem is at the root of this

I must admit, I lied to you. The notion of "changing existing code" is not fully arbitrary. The deciding factor is, what can be separated to a different binary. Under the hood classes must for example be packaged as a whole, in a sigle compilation unit. Similarly; functions are just an address under the hood, or slightly more abstractly: a span of consecutive instructions. You can't devide them into different units. Expression problem really comes from the assumption, that existing data types and functions come from an external library (compilation unit) that we are architecturally forbidden to modify. In that case the whole idea of pivot tables becomes a lot harder, because we know, that the existing table is already in a compiled state, in the form of functions, which as we know is only an abstraction for a span of consecutive instructions. For that reason when compiling such library the program must already have in mind a possibility of external extension, and insert the appropriate constructs into the assembly. (That may, and most probably will result in some overhead, but I still don't have full grasp on this).

Conclusion

I think that this is a very interesting idea, and I will try to implement it in my master thesis if it turns out to be a worthy topic. I will try to make a "language" that will allow to define pivot tables, and then compile them to non-text object representation. Then I need the compiler that can compile this straight into binary. For this I think I'll need an understanding of multiple dispatch, and how it may work in the asm code underneath.

In the next post I will probably examin on a closer sight the intricacies of package split, but for now I think I will leave it at that.