Understanding dbt. Models

Introduction to the main block of the dbt

Mykola-Bohdan Vynnytskyi
7 min readJun 3, 2024
Photo by Pierre Châtel-Innocenti on Unsplash

DBT has many components that facilitate and improve the work of a Data and Analytics engineer, but none of these components will make any sense without the main block — the model.

In this article, we will get to know models, what they are, how to work with them, and how to combine them together.

So without further ado let’s learn the models!

Models are simply sql files in which the basic logic of transformations is written, based on which a new object will be created.

Easy right?

First steps

As you might have guessed, working with models is quite simple, as well as with DBT in general.

In the beginning, an SQL file is created in the models’ folder, the name of which will serve as the name of the object in the database.

That is, if you name the file my_new_cool_table.sql, the name of the created object will be my_new_cool_table.

Models is a central folder where models, documentation, and tests for them are stored.

Inside the file, a select statement to the database is written. Based on this request, a new object will be created in the database.

So if you write:

SELECT * 
FROM my_old_cool_table

then dbt will create an object that will correspond to the scheme of this result (in this case, since we select all columns, it will be the same scheme as in my_old_cool_table) and fill it with data according to the result (in this case we simply copy all the data from my_old_cool_table).

After the file is created and populated, we run the following command:

dbt run

which tells DBT to compile and materialize the object in our models folder.

Considering the nuances

Based on the previous section, two problems can immediately be noted:

1. Why do we need to make separate files for simple selects?
2. If there will be more than one file in my folder, why do I have to run the creation of each object in it?

Let’s try to dive a little deeper into these topics, answering these questions and asking new ones.

As it was said at the beginning, models are files that store the basic logic of transformations based on which objects will be created.

The key idea in this is transformation.

Transformations can be simple, such as taking only a partial result from the original table or changing the names of the columns, or something more complex using analytical functions and joins of several tables, the main idea is that at the very end of our file, there was a select statement for these transformations, based on which DBT will create an object.

How should we select our transformations?

We can simply do it under the subquery

SELECT *
FROM (SELECT custom logic here)

And it will work perfectly, but the readability of our code will leave much to be desired.

A good practice in dbt projects is modularity.
We can achieve modularity at the file level using Common Table Expression (CTE).

CTE is a fairly common term in various databases, I’m sure you’ve come across it before.

The main idea is that we create an ephemeral table using the following command

WITH table_name AS (
SELECT custom logic here
)

We can also create several tables by simply dividing them by commas:

WITH table_name_1 AS (
SELECT logic
),
table_name_2 AS (
SELECT logic
)

after which we can write the main logic based on which we will create the object:

SELECT t1.*
FROM table_name_1 t1
LEFT JOIN table_name_2 t2
ON t1.col = t2.col

In this way, the basic logic will be clearer and if we need to change something in the future, we will easily change individual CTEs.

In general, all these files serve as the main base for our data project, and anyone related to it will be able to open repositories and easily understand how this or that object is created.

Addressing the Initial Questions

Now that we have a good background and the necessary knowledge, let’s answer the main questions:

  1. Why do we need to make separate files for simple SELECT statements?

Using separate files for each model helps maintain a clear and organized structure, especially as the complexity of transformations grows.
Even simple SELECT statements can benefit from being modular and isolated, making it easier to manage, update, and debug each transformation independently.

2. If there are multiple files in my folder, why do I have to run the creation of each object individually?

With dbt, you don’t have to run the creation of each object individually. You can selectively run specific models using the dbt run --select model_name command, where model_name is the name of the particular model you want to run.
This allows you to execute only the desired transformations without processing the entire project, providing flexibility and saving time.

Materialization of Objects

Materialization in dbt is all about how the transformed data is stored in your data warehouse. Different types of materializations are available, each suited for specific scenarios. Let’s break down the main types and when to use them.

  1. View
  • Description: A view is essentially a stored query that doesn’t materialize the data. Instead, it runs the query each time the view is accessed.
  • Use Case: Use views when you need the most current data without requiring additional storage. This is useful when the underlying data changes frequently.

2. Table

  • Description: A table materializes the data as a physical table in the data warehouse, storing the results of the query.
  • Use Case: Tables are ideal when you need fast read access to relatively static data. It’s perfect for data that doesn’t change often but needs to be accessed regularly.

3. Incremental

  • Description: An incremental model updates only new or changed data rather than rebuilding the entire table. It keeps the existing data and appends the new data.
  • Use Case: Incremental materialization is great for large datasets where it’s inefficient to rebuild the entire table each time. It’s ideal for data that grows over time, such as logs or transaction records.

4. Ephemeral

  • Description: An ephemeral model is not materialized in the database. Instead, dbt inlines the model’s SQL into dependent models, making it a temporary step in the data transformation process.
  • Use Case: Use ephemeral models for intermediate transformations that don’t need to be stored physically but are necessary for building final models. It’s useful for complex transformations that are part of a larger data pipeline.

Choosing the Right Materialization

Choosing the right materialization can significantly impact the performance and efficiency of your data pipeline. Here’s a quick guide:

  • Views: Best for displaying the most current data without consuming storage space.
  • Tables: Ideal for fast read access to relatively static data that doesn’t change often.
  • Incremental: Perfect for handling large, continuously growing datasets efficiently.
  • Ephemeral: Great for simplifying complex transformations without needing to store intermediate results.

By understanding and selecting the appropriate materialization for each model, you can optimize both the performance and storage of your data transformations in dbt. This allows for better management and scaling of your data projects.

Connections Between Objects

One of the powerful features of dbt is its ability to create and manage dependencies between models, ensuring that data transformations occur in the correct order. This is achieved by using the ref() function, which helps define dependencies between models and builds a Directed Acyclic Graph (DAG) of your data pipeline.

Let's explore how this works and the benefits it brings.

Using ref() to Create Dependencies

In dbt, the ref() function is used to reference other models within your project. This function tells dbt which models a particular model depends on, allowing dbt to determine the correct execution order. By using ref(), you can ensure that your transformations are based on the latest available data from upstream models.

Here’s an example:

-- my_new_cool_table.sql

SELECT *
FROM {{ ref('my_old_cool_table') }}

In this example, the model my_new_cool_table depends on the model my_old_cool_table. By using ref(), dbt knows to run my_old_cool_table first before running my_new_cool_table.

Building the DAG

When you define dependencies using ref(), dbt automatically builds a Directed Acyclic Graph (DAG) of your models. This DAG visually represents the order in which models must be executed and how they are connected.

The DAG helps you understand the data flow through your pipeline and makes identifying and managing dependencies easier.

You can visualize the DAG in the dbt documentation site, which is generated automatically by dbt. This visualization provides a clear and interactive way to explore the relationships between your models.

Benefits of Using ref()

  1. Order of Execution: Ensures that models are executed in the correct order, maintaining data integrity and consistency.
  2. Modularity: Encourages modularity by allowing you to break down complex transformations into smaller, manageable pieces that can be reused and maintained independently.
  3. Documentation: The DAG visually represents your data pipeline, making it easier to understand and document the data flow.
  4. Change Management: By explicitly defining dependencies, you can easily track how changes to one model might impact others, reducing the risk of errors.
  5. Scalability: As your data project grows, the DAG helps you manage and scale your transformations by providing a clear overview of the entire pipeline.

Here’s an example of a more complex transformation using multiple references:

-- my_aggregated_table.sql

WITH base_data AS (
SELECT *
FROM {{ ref('my_old_cool_table') }}
),

aggregated_data AS (
SELECT
category,
COUNT(*) AS count,
AVG(value) AS average_value
FROM base_data
GROUP BY category
)

SELECT *
FROM aggregated_data

Conclusion

In this article, we delved into the core component of dbt: models.

We discussed how to create models and modularity's importance and touched on materialization and connections between objects. By understanding these concepts, you can better manage and scale your data projects with dbt.

I hope this article has provided you with a solid foundation in dbt models. Stay tuned for the next article.

If you found this article helpful, don’t forget to clap and subscribe to learn more about dbt and other data technologies.

See you next time!

--

--

Mykola-Bohdan Vynnytskyi

Data Engineer by day. YouTuber, author, and creator of courses by night. Passionate about Big Data and self-development.