dbt development at Vimeo

How we leveled up our data transformation workflow.

Evan Calzolaio
9 min readFeb 4, 2022

--

Do you work with data? Then you’re likely familiar with dbt, a tool that aims to “let data teams work like software engineers” by enabling engineers and analysts to define data transformations with simple SQL.

This blog post is for anyone who is interested in or sold on dbt Cloud, the SaaS platform offered by dbt Labs that wraps features of dbt with a web interface and provides additional tools such as job scheduling, Slack notifications, and GitHub integration. You may be in the midst of evaluating the product or just researching better ways to work with data at your company. Whichever bucket you fall into, you’ll want to get a sense of how much engineering effort is required to set up dbt with distinct environments and Continuous Integration testing.

In this post, I share how we on the Data Engineering team at Vimeo develop with dbt and how it compares to our previous workflow.

Understanding data at Vimeo

At Vimeo, we ingest data from dozens of data sources into our Snowflake data warehouse. With an abundance of raw data available, data modeling is critical in enabling us to understand the business, track key performance metrics, and make projections. It’s therefore important to understand the lineage, or relationships, of our data models and to version them via source control.

Most of our ETL and ELT pipelines are orchestrated by Apache Airflow, while business logic is generally stored as SQL files in source control and executed with SnowSQL. Incremental loading, slowly changing dimensions, and backfilling are all implemented manually — we have even created Jinja templates for incremental models in the past. Many of our Airflow-based pipelines were only performing transformations (the T in ETL). This is where we would like to introduce dbt to replace Airflow, since dbt specializes in SQL transformations.

Why we did it

With Airflow orchestration, we found testing SQL flow to be a huge challenge. We typically tested Airflow DAGs, or Directed Acyclic Graphs, in a separate staging environment that was cloned from the production database, but this human-in-the-middle approach made automated Continuous Integration testing almost impossible. Additionally, access to data model creation was stifled by this gated approach to development. Analysts might have the SQL skills to create data models, but deploying models to production with Airflow required a separate set of skills. This created a significant bottleneck in model creation and led to workarounds that were outside of the governance of our Data Engineering team.

We saw dbt as a solution that would centralize data model governance under the Data Engineering team, standardize common SQL procedures, and integrate a testing framework for data models. We wanted to empower analysts and engineers to deploy data models that were automatically tested.

To summarize, we had to work through the following challenges:

  • Business logic was written in SQL but orchestrated with Airflow.
  • Airflow wasn’t analyst-friendly and required a separate skill set.
  • Airflow made testing SQL difficult because of our human-in-the-middle approach.

dbt solves these challenges by separating SQL logic from Airflow orchestration, simplifying the development workflow, and opening the door to automated testing.

How we did it

Before setting up our dbt environment and workflow, we outlined a few criteria that had to be met to keep production in a healthy state:

  • Models should be tested and validated at the pull request, or PR, level.
  • Users shouldn’t be able to merge unless their model backfills, increments (if applicable), and passes its tests successfully.
  • To keep the development experience analyst-friendly, we shouldn’t enforce anything that can’t be done via the dbt Cloud UI. An example of this would be pre-commit hooks.

These criteria are going to make a big difference during our staging step (see Figure 1). But first, let’s start with an overview of the development environment.

Figure 1: This figure is an overview of the developer workflow, from local development to tests that take place in the cloud. The figure shows that a developer can iterate locally or in the dbt Cloud IDE and open a pull request with their changes. From there, Jenkins will run a Python script telling dbt Cloud which branch to test.

Development

Development may be done via the Cloud UI or locally; however, we expect most contributors to use the Cloud UI. Additionally, developers and analysts should have access to production data but shouldn’t be able to affect production data or models. To balance access to fresh data with the health of our database, each Snowflake user at Vimeo is granted a personal schema in which only they can create database objects at will. Developers then set their personal schema as the target schema in the dbt Cloud UI, or in their ~/.dbt/profiles.yml file if developing locally.

But what if we want to build models across multiple schemas? We’ve decided to prevent this in the development environment but enable it for both the staging and production environments. For example, a developer can still say that a model should eventually be materialized in the Vimeo schema, but, while they are developing, the model will only be materialized in their personal schema. To accomplish this, we defined the following dbt macro:

You can learn more about how to override the generate_schema_name function in the dbt documentation.

Staging

At Vimeo, we have a staging database, which is effectively a copy of the production database. We use this database to test the models at the final stage before production. When a developer opens a pull request, we leverage Jenkins to trigger a dbt job with the purpose of testing that dev’s changes end to end. Here’s a look at the commands we run in the staging tests:

  1. dbt run --select state:modified --full-refresh
  2. dbt run --select state:modified
  3. dbt test --select state:modified
  4. dbt snapshot --select state:modified

We first check that a model is capable of a full refresh, or backfill. We then check that it’s able to increment. To complete the checks, we make sure that the model’s tests pass. The selector state:modified helps us keep these checks lightweight and fast. What this does is compare the dbt manifest of the test branch against an up-to-date version of the project manifest. This relationship is easily defined in the dbt job’s settings (see Figure 2).

Figure 2: This figure shows how to select which dbt Cloud job to use for state comparison. In our case, it is a job titled “compile production artifacts,” which builds an up-to-date version of the project manifest every 10 minutes.

If you’d like to learn more about the state selector, see the dbt documentation.

A note on incremental models and CI performance

At Vimeo, we are almost always working with large volumes of data. To create performant tables that can be updated quickly and that yield results in a reasonable amount of time, we prefer to use incremental models. This presents new challenges that make testing difficult, the most significant being, how do we test both backfilling and incrementing an incremental model? The solution is simply to limit the data read during CI checks.

We expect developers to self-limit the size of their models when running in development and staging. Our dbt onboarding documentation encourages developers to include the following macro when developing incremental models with large volumes of data:

This significantly reduces the amount of data that is used in the full-refresh step of our checks and also speeds up development. If the developer doesn’t include this, their model will likely time out the checks, which shows as a failure on the PR.

Keeping an up-to-date manifest

Developers pushing code to the main branch between runs of our production job creates a lag in what is recorded in the production manifest. The production job may have run 12 hours ago, but since then four developers have pushed changes to the main branch and a few others have branched from it. When these developers push changes and test their models, dbt thinks that the changes from the main branch, in addition to their actual changes, need to be tested. The lag that has formed is expensive in test and compute time and weakens our developer experience.

To solve this, we introduced the dbt job you saw above, “compile production artifacts.” This job has only one step, dbt compile, and runs every 10 minutes. This produces a set of compiled SQL files as well as a manifest.json file directly from our main branch, representing the latest state. We then defer to this job for all tests, ensuring that developers are testing only the changes they pushed.

You can read more about dbt compile in the dbt documentation.

Triggering the dbt checks

I mentioned that we use Jenkins to detect PRs and run checks. It’s important to note that we have a custom solution, since Vimeo uses GitHub Enterprise, which isn’t directly supported by dbt Cloud and doesn’t support GitHub actions at this time. There are a few short steps involved here:

  1. A new Jenkins multi-branch pipeline is created and pointed at the dbt repository.
  2. The developer opens a pull request.
  3. The developer comments “build PR” on their pull request, triggering a Jenkins run. This step is to prevent checks from running on each commit, which can be expensive.
  4. Jenkins uses a Kubernetes config to run a Python script.
  5. The Python script triggers the dbt job via API, specifying the branch of the PR as one of the arguments.

With those steps outlined, here’s a look at our Jenkinsfile:

You may have noticed the dbt_jenkins_script.py file. This script by the dbt engineering team triggers a dbt job and pokes for the status.

Once the dbt job has completed, the developer receives a check on their PR indicating that their model passed successfully (see Figure 3). After they receive an approval from a member of the Data Engineering team, they are able to merge their PR. It will then get picked up by our scheduled production job (more on this in the next section).

Figure 3: When a developer has successfully passed the automated checks, a green check shows on their PR indicating that all checks have passed.

At this stage, it’s important to acknowledge how much time we have saved both maintainers and developers by introducing these checks into the development process. Rather than relying on the human-in-the-middle approach to testing, which was both error-prone and time-consuming, now both parties are confident that the model will materialize successfully in production and that there are no syntax errors. Instead of developers spending time deploying to a staging instance and waiting for Airflow to run, they are contributing code and receiving feedback from automated tests. Additionally, maintainers aren’t relying on developers to self-report the status of their data model and have direct access to the dbt logs through the pull request. We have achieved the major goals outlined at the start of this post!

Production

For production (see Figure 4), we wanted to start with a monolithic dbt run command but write the API functions such that we can reuse them to trigger other dbt jobs in the future. Our team heavily discussed a few options for organizing dbt runs and decided to keep the run monolithic so as not to abstract prematurely. This also made dbt easier to onboard and maintain, which worked well for our centralized team at the time.

Figure 4: The diagram shows the relationship between Airflow, dbt, and Snowflake. Airflow acts as the orchestrator, while dbt performs transformations on the data warehouse.

Currently we run a single dbt production job twice daily. The production job is triggered via API by an Airflow DAG. The code for this DAG is:

run_dbt_job and get_run_status are defined separately in one of our internal Python packages:

The Python functions and DAG borrow heavily from the script that we mentioned in the Staging section above.

Scaling

We intend to support models that run more frequently in the future. One dbt feature we are considering for this purpose is tags. This would enable us to tag some models as hourly or daily, and it would also enable us to separate models by business function or set up more complex dependencies if needed. We’re already using tags to separate incremental models that need to be backfilled using a larger warehouse before joining up with the rest of the production models.

In conclusion

This is how we’ve implemented isolated dbt environments at Vimeo. I hope that we’ve helped you understand the effort required to configure a similar set of environments at your company.

If you’re still looking for more information on dbt deployment and testing, there are many best practices outlined by the dbt team themselves. Check out What are the dbt commands you run in your production deployment of dbt? and How we sped up our CI runs by 10x using Slim CI. I also highly recommend joining the dbt Slack.

Join our team

Our team is always growing. Check out our open positions!

--

--