💾 Archived View for dioskouroi.xyz › thread › 29417998 captured on 2021-12-05 at 23:47:19. Gemini links have been rewritten to link to archived content

View Raw

More Information

⬅️ Previous capture (2021-12-04)

🚧 View Differences

-=-=-=-=-=-=-

Hamilton: A Microframework for Dataframe Generation

Author: gammarator

Score: 56

Comments: 20

Date: 2021-12-02 16:11:42

Web Link

________________________________________________________________________________

feoren wrote at 2021-12-02 19:52:29:

It always amazes me how readily people assume that their problem is unique and special. This article can be summarized as: standard software engineering problem gets solved with standard software engineering practices. The problem they describe of a giant function that gets touched every time something changes is called a "God Function" or a "God Class" and it's one of the most common problems in software engineering. Their solution is called "Dependency Inversion" and it was designed to solve literally that exact problem in the exact way they're doing it.

That doesn't mean it's easy! Anyone who applies Software Engineering principles to make their code better deserves congratulations. So congratulations, Stitchfix! But you could have found this solution a lot quicker if you didn't fall into that age old trap of thinking your data-science code was _different_, and _excluded_ from standard software engineering. Everyone does this, and I don't know why. They read all these blogs and books on software engineering best practices, but then they don't apply them to anything. You can't apply it to your SQL because it's just _different_. Your database organization is _different_. Your I/O code is _different_. Your API surface is _different_. Your UI is _different_. Your feature engineering is _different_.

No! It's software. Stop thinking that you need a unique solution to your unique snowflake problem. Apply well-known and good software engineering practices to everything you ever do.

They even say in the article "It hasn’t grown convoluted out of ... bad software engineering practices". _Of course it has!_ What do you think software engineering practices are supposed to apply to!?

mtVessel wrote at 2021-12-02 20:09:29:

In my experience, many data engineers and data scientists know nothing about software engineering, unless they've had a previous life as a coder. Most data engineers who claim they know python really only know pandas, and many couldn't implement fizzbuzz without a library.

I'm a data guy, but I started out as a developer, and eventually choose to specialize in data systems. I'm continually amazed to find that this path is the exception.

inadequatespace wrote at 2021-12-06 01:17:43:

The reason that path is an exception is that generally-speaking (in VHCOL areas at least), a developer job is as good or superior a job pay-wise to a data science job.

Also, since DS is so broad skillset-wise, it's basically a catch-all starter career for everyone that doesn't have CS/ SWE experience and training specifically, but _does_ have some non-CS STEM training.

inadequatespace wrote at 2021-12-06 01:14:32:

> many couldn't implement fizzbuzz without a library.

case in point: this article teaching them how to do so to pass interviews

https://towardsdatascience.com/how-to-solve-the-fizzbuzz-pro...

qqqwerty wrote at 2021-12-03 03:07:01:

Most software engineers know nothing about software engineering. Hop on over to any thread about hiring and you will see plenty of comments about SE candidates failing fizz buzz.

krawczstef wrote at 2021-12-02 21:38:10:

Author here. Yep, definitely don't think conceptually what we've done is unique, but the implementation/use is. The "trick" here, is that the end user doesn't need to know about these great software engineering best practices, they happen naturally.

Re: "god class" or "function" -- that's not accurate. It's more like a "god script". This script was written by different authors, with different styles, all modifying replacing/adjusting the code in this script.

Now, yes, could that team have avoided some of the problems had they just thought about it, kind of -- but as the post points out, that would only get you so far. The paradigm created ticks all the boxes and scales very well with the team and code base size.

Anyway, I would love for you to install hamilton (pip install sf-hamilton) and try it, and give a first hand perspective :) We're always after feedback. Cheers!

noobhacker wrote at 2021-12-02 20:31:27:

Could you elaborate on how this Hamilton framework is a case of Dependency Inversion (which in my understanding is about removing dependency on low level class and using dependency on high level class instead)?

What's the low and high level classes in this application?

feoren wrote at 2021-12-02 23:43:09:

Let me preface by saying terms like "Dependency Inversion" are fuzzy and meant to convey general strategies that can be refined for specific cases, so my exact definition of DI here might be subtly different from others', and certainly overlaps with other ideas that also have their own fuzzy names.

In its general form, Dependency Inversion says "don't depend on concrete things; depend on abstract things" with the corollary "don't go get something you need; _ask_ for what you need in general terms and let someone else figure out which concrete things you actually get". Following this to its logical conclusion often ends up with exactly the kind of directed acyclic graph data structures they describe.

So look at their old vs. new example:

Old:

    df['COLUMN_C'] = df['COLUMN_A'] + df['COLUMN_B']

df is a one extremely specific data frame. It's saying "take this specific column of this exact data frame and add it to this other specific column of the same exact data frame". The result is a new specific column of exactly one data frame.

New:

    def COLUMN_C(COLUMN_A: pd.Series, COLUMN_B: pd.Series) -> pd.Series:
        return COLUMN_A + COLUMN_B

COLUMN_A and COLUMN_B are _any series_. The result is a strategy for creating new columns from existing ones. This is very general. From a DI perspective, COLUMN_A and COLUMN_B can be thought of as "requests".

A directed acyclic graph represents the result of finding the appropriate (concrete) dependencies for every (abstract) request and linking them together. This is what "dependency injection" frameworks are essentially doing. Often this is done via convention, including looking at the names of variables and functions. Hamilton is doing this. In my opinion, names of variables and functions should _not_ be used by dependency injection frameworks if you can avoid it, because it makes the code brittle in the face of what should be "safe" refactors (including minification and uglification). But it's possible it can't reasonably be avoided in Hamilton's case and may be the right choice.

Note that you don't need the _injection_ framework (Hamilton) to benefit from using dependency _inversion_. This is often what people mean when they say using dependency inversion makes your code more testable: you can test it in isolation by just calling the function directly, instead of depending on the injection framework to stitch things together for you. That's well and good, but testability is just a side benefit of the real win of cleaner and better organized code.

If you want an even more general pattern, it's a common and effective strategy to look at the structure of the code you're sick of writing, and seek to encode that structure explicitly into a data structure that you can programmatically manipulate. Less code-as-code and more code-as-data. That used to be called "metaprogramming", but it lost its name because it's so general and ubiquitous now. This huge category of refactors covers things like iterables instead of loops, reactive streams instead of callbacks, dependency injection frameworks, expression trees, various forms of reflection, and more.

akdor1154 wrote at 2021-12-02 19:44:09:

Very interesting. In general i'm defaulting to "ahh not another syntax/grammar to learn!", but i can absolutely appreciate that a code base written with this would be far better than the messes of opaque pandas wrangling it would otherwise involve.

I really appreciate API design that consciously encourages clean code, and on that basis this looks fantastic.

spratzt wrote at 2021-12-02 16:59:12:

Looks great and I shall definitely try it.

I've always thought that Stitchfix were the most interesting ecommerce retail operation. It's a shame that none of their data science roles are available in London.

dash2 wrote at 2021-12-02 17:39:25:

Nice. I've really felt the pain point here. In R I use Drake to run a big pipeline on my laptop. In theory that provides the same "only run the computations you need" DAG, but because outputs are defined as whole data frames, changing or adding a single column often forces recomputation of everything. Column level isolation like this is the smart way to go.

__mharrison__ wrote at 2021-12-02 17:42:10:

I would love to see more real-life examples. My take is that most people don't know about the .assign method [0] or severely underutilize it.

0 -

https://www.metasnake.com/blog/pydata-assign.html

pvitz wrote at 2021-12-02 18:56:02:

I don't understand the advantage of the assign-method in the article you have linked to. If I would like to alter data in a column of a large dataframe, why should I use the assign-method (and copy data) instead of the mentioned loc-method and just overwrite the data (without a copy)? I am assuming, of course, that I will not need the original dataframe anymore.

lmeyerov wrote at 2021-12-02 21:32:44:

Yep, we regularly use assign & pipe to avoid the 80% case I think the article is about:

    df2 = df.assign(new_col_1=f(df))

or

    def add_new_col_1(df): return df.assign(new_col_1=...)
    df2 = df.pipe(add_new_col_1)

We do use reified compute DAGs in some places, but by that point, we're using dask anyways, and that kind of code ends up more annoying than if we could avoid. Speaking as someone who has done years of FRP/streams/etc., if users can stick with direct control flow or things that look like it (async/await), help them do it :)

RE:Immutability, it's a convention that eliminates some classes of bugs, so quite nice when a team does it. Code written by senior / reviewed teams are nice b/c these things add up to either a pleasant experience or perpetual paranoia for the day-to-day:

* A lot of our code is in notebook envs, where the ability to move cells up/down matters, so we try to only do single assignments to avoid non-reproducibility bugs

* Similar in production, but more about when someone comes back and wants to edit or log. having to undo the name reuse is annoying, and may lead to bugs when you don't notice it

wxnx wrote at 2021-12-02 16:46:10:

This is awesome to see!

I had a very similar idea recently (representing "big" preprocessing pipelines in Python as DAGs), but for research-scale projects. Funnily enough, I was also motivated by a time series project that has resulted in several thousand lines of gnarly preprocessing code, and growing - mostly in Pandas.

I'm not sure Hamilton would totally work for our use case, but I'll be following closely either way. Thank you for open sourcing this!

elijahbenizzy wrote at 2021-12-02 17:10:07:

Author here -- that's awesome! Great minds think alike. Always interested in use-cases (that both fit and don't) -- feel free to open up an issue with something representative/what you want and I'm happy to contribute my opinion.

lr1970 wrote at 2021-12-02 18:59:22:

An earlier thread 23 days ago [0]

[0]

https://news.ycombinator.com/item?id=29158021

imcoconut wrote at 2021-12-02 17:19:52:

This looks amazing, and very timely.

Any chance you have a mirror on gitlab (or github)?

zxexz wrote at 2021-12-02 17:24:52:

This[0] was linked in the article.

[0]

https://github.com/stitchfix/hamilton

jcmontx wrote at 2021-12-02 19:45:10:

Thanks I'll stick to Verstappen