💾 Archived View for dioskouroi.xyz › thread › 29424445 captured on 2021-12-03 at 14:04:38. Gemini links have been rewritten to link to archived content

-=-=-=-=-=-=-

What Is Dbt and Why Are Companies Using It?

Author: mooreds

Score: 89

Comments: 67

Date: 2021-12-02 23:28:30

Web Link

________________________________________________________________________________

beckingz wrote at 2021-12-03 03:12:22:

Data Build Tool makes it easier to get data analysts involved, which helps bring more analytics code into version control. This is a really really good thing given that most data analysts are not software engineers and rightfully prioritize putting their analytical skills to use in the most efficient way possible, which tends to result in analysis getting lost in a data swamp.

jerye wrote at 2021-12-03 03:14:38:

I made a little animated slide deck (

https://dbt.picturatechnica.com/

) a while back about dbt for data analyst type folks that want an intro. You can slowly step through each slide to get a feel of what dbt is going to build in your datawarehouse.

P.s. I currently work for dbt Labs.

benjaminwootton wrote at 2021-12-03 07:02:51:

Not to steal the OPs thunder, but I have also created a few resources on DBT if they help.

A demo of integrating DBT and Snowflake

https://timeflow.systems/connecting-dbt-to-snowflake

How DBT DevOps enables data teams:

https://timeflow.academy/blog/how-dbt-devops-enables-data-te...

I am also writing a tutorial on DBT, though there are a few rough edges still:

https://timeflow.academy/dbt-introduction

atentaten wrote at 2021-12-03 15:09:16:

>A demo of integrating DBT and Snowflake

https://timeflow.systems/connecting-dbt-to-snowflake

This link returns a 404

pqyzwbq wrote at 2021-12-03 01:59:24:

I use dbt to build small dremio data warehouse. dbt is using software engineering ideas to build data, it is quite awesome. I can use github to version control it, can did some unit test (data quality check mainly).

Another thing is its macro systems. Although Macros are not a good programming language. But it makes integrate new DBMS much easier, because most 3rd party plugins are mainly written in Macros. So, Macros is good too if it is provided by other people (not need to be written by myself :) )

ramraj07 wrote at 2021-12-03 03:36:29:

The issue with dbt is now folks end up needing to spin up the entire repository to run a bunch of sql queries. The issue is this is a tool that’s targeted at teams that don’t have rigorous engineering backgrounds so it’s bound to be used in suboptimal ways. If the dbt team can do things about it, it might be even better.

adeelk93 wrote at 2021-12-03 05:12:21:

Before dbt, I didn't even really think about structuring how I'd be running my queries. My experience with it has been - it's made it clear to me what has been sub-optimally laid out. Organize my fact tables by subject, adopt a common schema before applying business logic, utilize re-usable views instead of copy-pasting CTEs, etc.

And I like its tight git integration. The alternative, quite often, is a folder of scripts on an analyst's desktop.

CactusOnFire wrote at 2021-12-03 05:14:32:

Yeah "forcing version control" seems like a feature (pertaining to data governance), not a bug.

bearfood wrote at 2021-12-03 05:16:57:

Shameless, but very relevant plug:

My team have been working to solve this problem (and more). We recently released our CLI tool, palm, as OSS. Along with a plugin, palm-dbt, which we designed specifically for working with dbt projects in an optimal way.

Palm requires running code in containers, so teams can ensure dependencies and versions are correct. Then the plugin provides sensible defaults, with solutions for common problems (like idempotent dev/ci runs) and an abstraction in front of the most common commands you need to run in the container.

It's been a huge boon for our analysts productivity, improved the reliability and quality of our models, and decreased onboarding times for new people joining our data team.

Links:

https://github.com/palmetto/palm-cli

https://github.com/palmetto/palm-dbt

xyzzy_plugh wrote at 2021-12-03 01:04:14:

I can't argue that dbt isn't great -- it is. It is, however

, unfortunate that Python is still the dominate lingua franca here. Between all the Python data tools and JSON being shuffled around ala singer.io, I just can't help but think there is a huge opportunity awaiting the engineers who build a company around an open source set of products using rigorous languages and schemas.

It's a goddamn madhouse.

oroul wrote at 2021-12-03 02:19:39:

Does it matter that dbt is written in Python? dbt models are still SQL at heart. Sure there's the addition of Jinja to enable references for data lineage and configuration, but it's all compiled to SQL with fine-grained control over what's produced.

Forgive me if I come across as combative, but I don't understand generic appeals to things like a language being _rigorous_. Rigorous to what end? What problem is it solving where that is required? If you know something specific in this domain where that level of rigor is needed, why not share what it is?

There are a _lot_ of problems in the analytics space (and a lot of opportunity for new practices, tools, and businesses), but I would argue that at the end of the day the primary issue is whether or not data producers choose to model data such that it is legible outside of the system that produced it much more than it is about any particular language or methodology.

beckingz wrote at 2021-12-03 04:05:38:

For typical datasets (~95% of companies have medium data on the order of gigabyte), you are 100% correct that the data modeling / formatting is the biggest challenge.

Having well modeled data that matches the business domain is a massive (2-10x) productivity boost for most business analysis.

theplague42 wrote at 2021-12-03 01:09:41:

Well part of the benefit is rapid development; it's mind-boggling how quickly someone can stand up a dbt project and begin to iterate on transforms. Using Python/SQL/JSON (at small/medium) scales keeps the data stack consistent and lowers the barrier to entry. No reason to prematurely optimize when your bottleneck is the modeling and not the actual data volume.

tomnipotent wrote at 2021-12-03 12:00:39:

> using rigorous languages and schemas

And what value does it add?

A vast majority of companies are working with < 1TB of data that sits neatly in a single cloud database. Python and tools like dbt are fantastic for a huge class of problems without compromising workflow velocity, and pushing transformations into SQL removes most Python-bound performance constraints.

Changing singer.io to require Thrift or protobuf schemas isn't going to add the value you think it is. How data is shuffled between systems is considerably less important and time consuming than figuring out how to put that data to work.

glogla wrote at 2021-12-03 17:37:30:

Singer and dbt are different - dbt orchestrates stuff (evaluates macros and runs SQLs) while Singer actually has the data flow through it. So rewriting Singer in something fast (I guess Java since even the most obscure database has JDBC driver for it) would definitely help.

It would only help once you start shuttling around terabytes.

killingtime74 wrote at 2021-12-03 03:03:50:

There are only languages people hate and ones no one use

pphysch wrote at 2021-12-03 03:38:41:

and golang

jgraettinger1 wrote at 2021-12-03 02:35:10:

Our own approach is to keep singer.io et al and JSON (flexible! streaming capable!), but deeply invest in JSON Schema to a) statically infer and map into other kinds of schema on your behalf -- TypeScript, Elastic Search, and SQL DDL so far -- and b) optimize validation so it's fast enough to be in the high-scale critical path. If you validate every document before it's read or written by your hacky transform, you've gone a long way towards limiting the bug blast radius.

lbotos wrote at 2021-12-03 02:47:14:

I think this is what meltano is trying to help with?

https://www.meltano.com/

theptip wrote at 2021-12-03 03:26:00:

Not really - meltano uses singer (ie extracts/loads data in JSON form) and dbt (for transformation, in the ELT pattern).

It’s a good tool (I use it), but the concerns GP is raising are very much its weaknesses.

smoyer wrote at 2021-12-03 01:19:58:

In the other hand, it's convenient to use the same language as Jupyter Notebooks.

amznbyebyebye wrote at 2021-12-03 01:09:53:

Can you give a before/after of what the desired state would look like?

jeanlaf wrote at 2021-12-03 03:22:53:

Might be worth looking into Airbyte!

thejosh wrote at 2021-12-03 03:48:59:

Airbyte looks great, and the UI is fantastic.

It uses Java for its connectors, and looks great but has issues importing a massive dataset into S3 as there is a chunk limit of 10k, and each chunk size is 5mb :).

jgraettinger1 wrote at 2021-12-03 04:44:19:

We've used

https://github.com/estuary/connectors/pkgs/container/source-...

to load data sets in the many terabytes. Caveat that, while it's implemented to Airbyte's spec, we've only used it with Flow.

jeanlaf wrote at 2021-12-03 04:55:40:

That’s the current focus of the team. Consolidating those connectors :)

lmm wrote at 2021-12-03 01:28:44:

Something like Databricks?

orzig wrote at 2021-12-03 00:34:02:

I am a user, and what sets them apart even beyond the product is the amazing community they have built that celebrates the unsexy aspects of data science. Just watch some videos from their annual conference, people are so excited to talk about rationalizing business logic

41b696ef1113 wrote at 2021-12-03 02:41:01:

Evidently, they have a free conference next week[0].

[0]

https://coalesce.getdbt.com

naruvimama wrote at 2021-12-03 09:35:38:

I have been trying to migrate our legacy ETLs to DBT, which is mostly strings peppered around in python code.

I like the macros, templates, the visualisation tool ..... and would like to make our code as DBT compatible as possible.

However, the DB we currently use (Vertica) does not have official support. There are dependency problems installing all of DBT. All I would like to get for my first milestone is to use refs and macros and better organise the SQLs. It is good enough if it generates standard SQL, that I can run outside DBT.

My wish:

I wish I could install select DBT packages just for what I need (templates, macros, refs, dependency) and still make sure I can gradually achieve DBT compatibility. At the moment it looks like all or nothing.

Tarq0n wrote at 2021-12-03 13:08:57:

If your transformations are already in Python scripts wouldn’t you be better off with Airflow or Luigi? dbt seems to be aimed at SQL users primarily.

naruvimama wrote at 2021-12-03 14:51:58:

They are essentially SQL transformation at the reporting stage. Python is just the glue code, a lot of attention has gone into the connectors but no so much into the complexity within the SQL.

The problem I would like to address is complex SQL written as strings.

Some parts of these repeat over multiple reports, column transformations, look ups, joins with dimensions .....

scomp wrote at 2021-12-03 00:18:44:

I recently applied for a job in a data warehousing team that was using this. The way they were using it replace their transformation stage from legacy tools like SSIS,DataStage and Informatica, It was certainly impressive and looked great to work with. I can see dbt becoming a lot more common moving forwards.

andygrd wrote at 2021-12-03 06:27:57:

Having used it for a little over a year now, I can say that it's strengths may lie in getting junior developer's code more free of bugs and dependency issues. It's just additional overhead when you're trying to do anything more complex, like a Kimball type II slowly changing dimension - then it's just a blocker. Unfortunately, as it becomes a defacto build environment, it's limitations start getting applied to everyone.

antruok wrote at 2021-12-03 07:10:52:

The snapshots feature should handle the slowly changing dimensions. But are you saying it's not flexible enough for certain edge cases?

andygrd wrote at 2021-12-03 08:19:38:

Yeah, dbt snapshots do a row hash and update if anything in the row changes. We had a source table that had a bunch of daily changing timestamps, e.g. "load date", that we needed to ignore, and focus on a business key. Dbt was an utter torment to try get this going. Ended up building a more elegant framework without it.

antruok wrote at 2021-12-03 10:01:39:

So apparently the check_cols parameter didn't help either?

I'm trying to learn about the critical pain points of dbt, and this case seems interesting.

theplague42 wrote at 2021-12-03 01:05:47:

dbt and ELT in general are such a game-changer for allowing rapid iteration on business logic by data analysts; the feedback loop feels much more like "normal" software engineering compared to legacy systems.

pkaye wrote at 2021-12-03 01:06:11:

How long before Amazon incorporates this as one of their services?

jorpal wrote at 2021-12-03 01:43:14:

I’m sure they’re working on it. Databricks, however, also has Delta Live Tables coming out soon I believe.

bxji wrote at 2021-12-03 05:05:30:

I’m still a little confused on what dbt does after reading the article. Is it like Trino that generates materialized views as output, with built-in version control and other features?

dikei wrote at 2021-12-03 07:01:41:

In your case, dbt would be the tool to manage your Trino view programmatically:

* You define the sql queries that select the input data as models and the dbt scripts (also sql) to combine/manipulate the models.

* On running, dbt will generate the Trino SQL queries to join/transform these models into the final result (also a model).

* Depending on your configuration, any models can be stored as a Trino view or it can be materialized as a data table that's updated every time you re-run the dbt scripts.

adeelk93 wrote at 2021-12-03 05:35:12:

I'd list the DAG and testing as core features as well, but yeah, basically. In a very transparent and non-magical way (I despise magical data solutions). If that summary doesn't resonate with you, it's probably not the tool for you. No need to force it.

I'm not familiar with Trino at all, but that sounds like a specific database. dbt is not a database, it is tooling for working with databases.

te_chris wrote at 2021-12-03 10:28:34:

I'm the CTO of a 15 person startup and dbt (along with Mode, Stitch and BigQuery) has been a superpower allowing me to moonlight as a CDO (or some shitty pastiche of one) and push forward our data usage and competence in ways I just didn't think possible. Modern data stack is a marvel in what it unlocks.

dikei wrote at 2021-12-03 02:53:02:

DBT is really nice, but not a game changer since it's basically a SQL generator and only as good as the underlying SQL engine of your data warehouse.

lunatuna wrote at 2021-12-03 03:41:33:

We’ve recently introduced dbt, fivetran and snowflake and it has been amazing on many levels. Speed of development along with CI/CD has been fantastic. Clarity of thought about organizing our work has helped with the more junior and platform specific people.

Before I got into it I was sceptical because I was thinking It’s just SQL and SQL is pretty easy and straightforward. It breaks bad habits in DW development.

I’m not sure I still get why but it has been one of the few things Ive seen in a long time in this area that has been a big jump in capabilities. There is so much more confidence on my team. We are moving much faster. We can just get on with delivering without having to undo and think of everything is precious or dangerous to change.

It took some hands on work and getting over the initial mind bend to get there but I wouldn’t go back to what we had. I would likely only make my next move if they had a similar setup in place or were open to change.

thejosh wrote at 2021-12-03 03:47:18:

Exactly what we've been doing for 2+ years at my job.

It's amazing.

Snowflake is amazing, but watch out for search optimization costs (it's great for append only), left joins taking FOREVER (avoid left joins as much as possible for large datasets).

theplague42 wrote at 2021-12-03 04:41:30:

What makes left joins perform poorly in Snowflake?

thejosh wrote at 2021-12-03 06:12:15:

It has to join each part of the previous join to the next join, and if you have a lot of joins this can get out of hand.

We have a lot of joins in our final fct orders from our intermediate table, and looks like this:

from foo

left join bar on bar.common_id = foo.common_id

left join baz on baz.common_id = foo.common_id

left join qux on qux.common_id = foo.common_id

left join waldo on waldo.common_id = foo.common_id

So waldo joins to qux, which joins to qux... I call it a "staircase join", as that's what it looks like in the SF profiler.

ziga wrote at 2021-12-03 03:20:41:

The dbt community will frequently run Snowflake, and its SQL engine is very good.

https://medium.com/@jthandy/how-compatible-are-redshift-and-...

beckingz wrote at 2021-12-03 04:10:54:

The beautiful (and scary thing) about cloud data warehouses is that you can scale them significantly, which means bad SQL mostly becomes a cost issue as opposed to a speed issue.

Also, dbt makes it easier to persist tables instead of views, which has a massive performance improvement.

jgalt212 wrote at 2021-12-03 05:00:25:

15,000 members of a Slack community.

That's an impressive number of highly paid people. Nice work on Dbt's people.

elchief wrote at 2021-12-03 01:50:12:

So, it's like a makefile with some SQL to do ELT?

dang wrote at 2021-12-03 01:54:43:

"_Please don't post shallow dismissals, especially of other people's work. A good critical comment teaches us something._"

https://news.ycombinator.com/newsguidelines.html

Edit: if you didn't mean it dismissively and want to clarify that, I'd be happy to remove this comment. We unfortunately see quite a few comments along the lines of "so basically, this is just super-simple $thing and therefore this is dumb" and I interpreted yours through that lens.

22c wrote at 2021-12-03 02:19:11:

Is it a dismissal or is it someone asking for affirmation on their understanding of the product? I think it can be read both ways.

swuecho wrote at 2021-12-03 02:35:48:

If it is for for affirmation. The answer is Yes. I even write a similar tool using Scons. However, dbt is far better.

dang wrote at 2021-12-03 02:20:28:

I wavered about that for a little while and then settled on the dismissive interpretation.

If I misread the comment and elchief wants to clarify, I'd be happy to apologize and correct the mistake.

benjaminwootton wrote at 2021-12-03 07:00:13:

I actually really like DBT and advocated for it. It is however quite a simple tool, just some wrappers and a workflow around SQL transformations. Maybe it’s simplicity is it’s quality, but the OP you replied to isn’t as far off the mark as similar comments.

thejosh wrote at 2021-12-03 03:47:41:

nope, it has lots of other fantastic features such as a dag, macros, etc.

elchief wrote at 2021-12-03 05:11:09:

make is the OG DAG. And has macros

swuecho wrote at 2021-12-03 05:40:04:

In theory, you could write dbt in make. Actually, Whenever I introduce to others about workflow concepts (DAG), I tell them it is similar to how make works.

nojito wrote at 2021-12-03 09:08:10:

dbt is the quickest way to watch your cloud costs skyrocket.

I really do hope people don’t think this is a way to not have dedicated data engineers on staff and instead “empower” analysts to do things for themselves.

boxysean wrote at 2021-12-03 13:59:40:

[dbt Labs employee here]

Cloud costs increasing could be a sign of more data being utilized productively. In any way, a mixture of data engineers and analysts is a healthy way to scale a dbt project to increase speed-of-delivery of analytics requests vs cloud costs. We at dbt Labs encourage the "analytics engineering" mindset to bring software engineering best practices into the dbt analytics workflow (git version control, code reviews), and so cloud cost considerations should be incorporated into mature dbt development practices.

mijoharas wrote at 2021-12-03 09:27:58:

Could you expand on this point?

thisguy47 wrote at 2021-12-03 00:14:03:

Unfortunate typo: It’s about to be a decade since the Harvard article touting data science as the _sexist_ job in the 21st century.

x3n0ph3n3 wrote at 2021-12-03 00:14:13:

It’s about to be a decade since the Harvard article touting data science as the _sexist_ job in the 21st century.

awinter-py wrote at 2021-12-03 02:56:42:

psychedelic in the tryptamine family