Understanding dbt. Seeds and Sources

Mykola-Bohdan Vynnytskyi
9 min readJan 12, 2025

Photo by Joanna Kosinska on Unsplash

Welcome back to the Understanding dbt series! In this article, we’ll continue exploring the core components of dbt and dive into two powerful features: seeds and sources.

Seeds and sources are essential in making dbt projects more versatile and easier to manage. Whether you’re working with small, static datasets or referencing raw data tables in your warehouse, these features simplify your workflow and improve collaboration across your team.

If you’re joining this series, I highly recommend checking out the previous articles to get a solid foundation:

Today, we’ll break down what seeds and sources are, when to use each, and how to set them up in your dbt project. By the end of this article, you’ll understand how to leverage these features to make your data pipelines more efficient and your projects more maintainable.

Let’s get started!

What are Seeds in dbt?

In dbt, seeds are static CSV files that you can load into your data warehouse as tables. Think of them as lightweight, easy-to-manage datasets that can be version-controlled and included in your dbt project just like your models, tests, and documentation.

But what makes seeds so useful? Well, they allow you to quickly create reference tables, mappings, or any other small, static datasets that your transformations might need. And the best part? You don’t need an external ETL process to load them — dbt takes care of that for you!

Why Use Seeds?

Seeds are perfect for:

  • Lookup Tables: Imagine you have a set of mapping rules or categorizations. Instead of hardcoding them into your SQL logic, you can store them in a CSV file and load them as a table.
  • Static Configuration Data: For example, a table with country codes, business regions, or other metadata that rarely changes.
  • Testing Datasets: If you’re experimenting with dbt or building a new pipeline, seeds can serve as mock data for testing transformations.

How to Use Seeds in dbt

Working with seeds in dbt is super simple and doesn’t require any advanced setup.

Example Seed File

Here’s an example of a simple seed file, regions.csv:

| region_id | region_name   |
|-----------|---------------|
| 1 | North America |
| 2 | Europe |
| 3 | Asia-Pacific |

When you run dbt seed, this CSV will be loaded into your warehouse as a table called regions.

And here’s how you can add and use seeds in your project:

  1. Create a CSV File: Add your static data in a CSV format and place the file in the data/ folder of your dbt project.
    For example:
data/
└── regions.csv

2. Load the Seed: Run the dbt seed command to load the CSV file into your data warehouse as a table:

dbt seed

3. Reference the Table: Once the seed is loaded, you can use it in your models like any other table:

SELECT *
FROM {{ ref('regions') }}

That’s it! The seed is now part of your dbt project and can be used throughout your transformations.

Real-World Use Cases

Let’s look at some practical scenarios where seeds can save the day:

  • Mapping Logic: Use a seed to map product categories or customer segments.
  • Static Dimensions: Create a table with holiday calendars or business-specific metrics that rarely change.
  • Testing and Prototyping: Quickly spin up a table with mock data to test your transformations without relying on external sources.

Seeds are a simple but powerful feature in dbt that makes your projects more flexible and easier to manage.

What are Sources in dbt?

In dbt, sources refer to raw tables in your data warehouse that serves as the starting point for your data transformations. They act as the link between your raw, upstream data and your transformed models. By defining sources, you make it clear which tables are external to dbt and allow for better documentation, testing, and data pipeline management.

Why Use Sources?

Sources help in several ways:

  • Clear Boundaries: They separate raw data (your source tables) from the transformed data created by dbt.
  • Documentation: With sources, you can easily document your upstream datasets, including their purpose and relationships.
  • Testing: Sources allow you to add tests directly to raw tables, helping you identify data issues at the source before they affect downstream models.
  • Traceability: They make your data lineage visible, so you always know where your transformed data originated.

How to Define Sources in dbt

Defining a source in dbt is simple. It requires adding the source configuration to your schema.yml file.

Here’s an example:

version: 2

sources:
- name: raw_data_source
description: "Source tables for raw data"
tables:
- name: transactions
description: "Raw transactions table"
- name: customers
description: "Raw customers table"
  • Source Block: The sources block defines the name of the raw data source and its tables.
  • Description: Adding descriptions for the source and tables improves documentation and helps team members understand the purpose of the data.

Referencing a Source in dbt

Once a source is defined, you can use the source() function in your models to reference the raw tables:

SELECT *
FROM {{ source('raw_data_source', 'transactions') }}

This tells dbt to pull data from the transactions table in the raw_data_source.

Example Use Case

Suppose you have a transactions table in your data warehouse that holds raw sales data. You define this table as a source in your schema.yml.
Then, in your dbt model, you use the source() function to reference it for transformation:

WITH raw_transactions AS (
SELECT *
FROM {{ source('raw_data_source', 'transactions') }}
)

SELECT
transaction_id,
customer_id,
transaction_date,
total_amount
FROM raw_transactions
WHERE total_amount > 0

This creates a clean, traceable path from the raw data source to your transformed model.

Real-World Benefits of Sources

Sources are especially helpful for:

  • Data Lineage: By documenting your raw data as sources, you make it easy to trace how data flows through your transformations.
  • Collaboration: Clear documentation of sources improves team collaboration, as everyone knows where the data comes from and how it’s being used.
  • Data Quality: Adding tests to sources helps catch issues in raw data before they propagate to downstream models.

Seeds vs. Sources: When to Use Which?

Now that we’ve covered what seeds and sources are, let’s break down the differences between them and explore when to use each in your dbt project.
While both are powerful tools for managing data, they serve very different purposes.

What’s the Difference?

Here’s a quick side-by-side comparison of seeds and sources:

| Aspect           | Seeds                                 | Sources                             |
|------------------|---------------------------------------|-------------------------------------|
| Definition | Static CSV files managed in dbt. | Raw tables in the data warehouse. |
| Purpose | Used for small, static datasets. | Connects to raw, upstream data. |
| Control | Fully managed within the dbt project. | Created and maintained outside dbt. |
| Use Cases | Lookup tables, mappings, testing. | Raw data for transformations. |

When to Use Seeds

Seeds are best used when you have small, static datasets that are part of your dbt project. Here are some typical use cases:

  • Lookup Tables: Store mappings, like region names or product categories.
  • Static Configuration: Create tables for fixed metadata, like business rules or operational thresholds.
  • Testing Data: Quickly generate mock data for prototyping or testing transformations.

Example: A seed file containing a country-to-region mapping:

| country_code | region_name   |
|--------------|---------------|
| US | North America |
| DE | Europe |
| JP | Asia-Pacific |

Run dbt seed to load this data into your warehouse, and reference it as a table in your models.

When to Use Sources

Sources are perfect for connecting to raw, external data that you don’t manage within dbt but need for your transformations. Common scenarios include:

  • Raw Data Tables: Reference raw transactions, customer data, or logs in your warehouse.
  • Upstream Data Pipelines: Document and test data ingested by other tools or systems.
  • Data Lineage: Define clear dependencies between raw data and transformed models.

Example: A source for a raw transactions table in your warehouse:

sources:
- name: raw_data_source
tables:
- name: transactions

Use {{ source(‘raw_data_source’, ‘transactions’) }} to reference this table in your models.

Complementary Use

Seeds and sources aren’t mutually exclusive — they work well together! For example:

  • Use a source to pull raw transaction data from your warehouse.
  • Use a seed to load a small table of business rules that maps regions to countries.
  • Combine them in a model to create clean, transformed data.
WITH raw_transactions AS (
SELECT *
FROM {{ source('raw_data_source', 'transactions') }}
),
region_mappings AS (
SELECT *
FROM {{ ref('region_seed') }}
)

SELECT
t.transaction_id,
t.country_code,
r.region_name,
t.total_amount
FROM raw_transactions t
LEFT JOIN region_mappings r
ON t.country_code = r.country_code

Testing and Documenting Seeds and Sources

Once you’ve set up seeds and sources in your dbt project, it’s important to add tests and documentation.

This ensures data quality and improves collaboration by making your dbt project easy to understand for your team and future contributors.

1. Testing Seeds and Sources

dbt makes it simple to add tests to your seeds and sources using schema.yml.
These tests help catch issues like duplicate entries, missing values, or invalid data in your datasets before they cause downstream problems.

Example: Adding Tests for Seeds

Here’s how you can add basic tests to a seed file, such as ensuring uniqueness or checking for null values:

version: 2

seeds:
- name: region_mapping
description: "Static mapping table for regions and country codes"
columns:
- name: country_code
description: "ISO country code"
tests:
- not_null
- unique
- name: region_name
description: "Name of the region"
tests:
- not_null
  • not_null: Ensures that no rows in the column have NULL values.
  • unique: Ensures that each value in the column is unique.

To run tests for your seed, use:

dbt test - select region_mapping

Example: Adding Tests for Sources

For sources, you can add tests in a similar way to validate the raw data. For example:

version: 2

sources:
- name: raw_data_source
description: "Raw data source for transactions and customers"
tables:
- name: transactions
description: "Raw transactions table"
columns:
- name: transaction_id
description: "Unique ID for the transaction"
tests:
- not_null
- unique
- name: total_amount
description: "Total amount for the transaction"
tests:
- not_null
- accepted_values:
values: [0, 1, 10, 100, 1000]

This ensures that:

  • The transaction_id column contains no nulls and is unique.
  • The total_amount column has only valid values (optional test for specific cases).

To run tests for your source, use:

dbt test --select source:raw_data_source

2. Documenting Seeds and Sources

Good documentation is critical for making your dbt project maintainable and understandable. Adding descriptions to your seeds and sources provides context about their purpose and how they’re used.

Example: Documenting a Seed

In your schema.yml, you can add descriptions for seeds and their columns:

seeds:
- name: region_mapping
description: "Static mapping table for regions and country codes"
columns:
- name: country_code
description: "ISO country code"
- name: region_name
description: "Name of the region"

Example: Documenting a Source

Similarly, document your sources to explain the role of the raw data:

sources:
- name: raw_data_source
description: "Raw data source for transactions and customers"
tables:
- name: transactions
description: "Raw transactions table"
columns:
- name: transaction_id
description: "Unique identifier for the transaction"
- name: total_amount
description: "The total monetary amount for the transaction"

Auto-Generated Documentation

When you run dbt docs generate, dbt uses the descriptions in your schema.yml to create a documentation site for your project. You can view this site by running:

dbt docs serve

This interactive site allows you to:

  • Explore the structure of your dbt project.
  • View descriptions of seeds, sources, models, and tests.
  • Trace data lineage through the Directed Acyclic Graph (DAG).

Conclusion

I hope this article has given you a clear understanding of seeds and sources in dbt.
These two features play a crucial role in making your data pipelines more efficient, maintainable, and collaborative. Whether you’re working with small, static datasets (seeds) or connecting to raw, upstream tables (sources), dbt provides the tools to manage them seamlessly.

By defining seeds and sources, adding tests, and documenting them properly, you can ensure data quality, traceability, and scalability in your projects. It’s all about making your data workflows simpler and more robust.

If you enjoyed this article, don’t forget to check out the rest of the Understanding dbt series:

Thank you for reading!

If you have any questions, comments, or ideas for future topics, feel free to reach out.

Don’t forget to clap and share if you found this helpful. See you in the next article as we continue exploring dbt and the world of data!

Sign up to discover human stories that deepen your understanding of the world.

Mykola-Bohdan Vynnytskyi
Mykola-Bohdan Vynnytskyi

Written by Mykola-Bohdan Vynnytskyi

Data Engineer by day. YouTuber, author, and creator of courses by night. Passionate about Big Data and self-development.

No responses yet

Write a response