Understanding dbt

Introduction to the T of the modern data stack

4 min readFeb 4, 2024

When I first started learning DBT, all the articles started by saying that DBT is the “T” in the ETL/ELT process.

On the one hand, everything seems clear, just a tool that transforms our data for a production-ready dataset, but on the other hand, it cannot be just that and indeed this technology provides many more benefits.

DBT allows you to integrate software development best practices into your data project, such as code versioning, testing, documentation, and more, easily and stress-free.

There needs to be more than one article to explain all the beauty of DBT.
So today we’re going to go over the basics of this tool and the concepts around it.

ETL/ELT

To understand the beauty of this technology, we need to take a step back and look at the bigger picture of the world of big data (frankly speaking, it is very funny to use the term “big data” in 2024 :D)

So beforehand there was the concept of ETL.
It is an abbreviation of the three words Extract, Transform, and Load.
The system Extracts some data from some source like a folder in the HDFS, or the API, for example, Transforms them so that they correspond to the appropriate scheme, and Loads them into the Data Warehouse for example.

This concept still works in some projects, but it limits us in that we can only analyze the data we have transformed and unfortunately cannot gain new insights from the original data.

“It would be great to be able to store the source data immediately in storage, and then transform it as needed” — thought smart people and created a new concept from the same words, but in a different order — ELT.

So now the system Extracts data from some source, Loads it in its raw form to the storage, and after that Transforms it as needed.

The beauty of this technique is that we always have access to the original data and, for example, the Data Scientist team will be able to run their ML models based on them, and the Data Analysts team, on the other hand, will be able to receive transformed data that will be saved in the data mart.

In fact, this topic can be expanded into a separate article, since it generally affects the evolution of data storage.
But in our case, a general understanding of these processes is sufficient.

Problems that DBT solves

Back to the main topic of our article.

As it was said earlier, DBT is responsible for data transformation, but in addition, it solves several problems faced by data projects:

Modularity

DBT allows the breakdown of data transformation logic into modular, composable pieces called models. This makes it easier to understand, maintain, and collaborate on the data transformation code.

Models are written in a simple and understandable SQL syntax, you specify the source table prescribe the transformation logic, and leave everything else to DBT.

-- dim_listings_cleansed.sql

WITH scr_listings AS (
    SELECT * FROM {{ ref('src_listings') }}
)

SELECT listing_id,
       listing_name,
       room_type,
       CASE
           WHEN minimum_nights = 0 THEN 1
           ELSE minimum_nights
       END AS minimum_nights,
       host_id,
       REPLACE(price_str,'$') :: NUMBER(10,2) AS price,
       created_at,
       updated_at
FROM scr_listings

In the example above, cleans the data from the scr_listings table and creates a new dim_listings_cleansed object.

To create a new object, it is enough just to name the file correctly, and write the logic using a combination of SQL and Jinja.

So simple!

Documentation

Documenting a project is something everyone knows needs to be done and something everyone doesn’t want to do.

# schema.yaml

version: 2

models:
  - name: dim_listings_cleansed
    description: Cleansed table which contains Airbnb listings.
    columns:

      - name: listing_id
        description: Primary key for the listing
...

DBT helps with this and makes life easier. It is enough to describe the objects in the YAML file, and DBT will generate all the documentation for you!

Testing

Testing is another important aspect of our work, and frankly, I like to cover all my code with tests, but not everyone likes it.

# schema.yaml

version: 2

models:
  - name: dim_listings_cleansed
    description: Cleansed table which contains Airbnb listings.
    columns:

      - name: listing_id
        description: Primary key for the listing
        tests:
          - unique
          - not_null
...

DBT will come to our rescue in this situation as well!
All that is required is to describe in the YAML file which values are accepted by this or that field.
Dbt has a large library of built-in tests for values and extension packages for even better coverage of the core functionality.

Version Control

With DBT, data transformation code lives in version-controlled files, typically stored in a Git repository.
This enables teams to track changes, collaborate effectively, and roll back to previous versions if needed.

Conclusion

I hope this article has given you an understanding of what DBT is and what it is capable of.

You can see an example dbt project in my git repository and if you liked this article don’t forget to clap!

Frankly, we’re only scratching the surface with this article, DBT can help with a lot more things, like implementing Slowly Changing Dimensions, custom functions, and hooks.

I’ll cover all of this in future articles, so don’t forget to subscribe to learn more about DBT and other Data World technologies.

See you!

Understanding dbt

Introduction to the T of the modern data stack

ETL/ELT

Problems that DBT solves

Modularity

Documentation

Testing

Version Control

Conclusion

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by Mykola-Bohdan Vynnytskyi

No responses yet